Open KarelVesely84 opened 10 months ago
The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi.
I think it is safe to enlarge the range as long as we can achieve the same goal.
Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is.
Hello @pzelasko, @csukuangfj, I just identified an open question related to the audio transforms.
In lhotse, there is the
Resample
class wrapping thetorchaudio.transforms.Resample()
.When resampling 32kHz->16kHz common_voice_cs_26209290, the
audio.max()
becomes 1.0079In streaming_decode.py in Icefall, there is a check that max audio sample must be
s<=1.0
What would be the cleanest solution to this ? a) Stop checking for
audio.abs().max()<=1.0
in Icefall. b) Introduce audio clippingAudioTransform
to lhotse. c) Introduce audio LimiterAudioTransform
(sth. like:if audio.abs().max() > 0.99: rescale_to_099(...)
) to lhotse. d) Try to add a check totorchaudio
, intotorchaudio.transforms.Resample()
.I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.
Best regards Karel
// Ps: All the best in the new "western" year !!