lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
950 stars 217 forks source link

Audio range out of (-1,+1) #1254

Open KarelVesely84 opened 10 months ago

KarelVesely84 commented 10 months ago

Hello @pzelasko, @csukuangfj, I just identified an open question related to the audio transforms.

In lhotse, there is the Resample class wrapping the torchaudio.transforms.Resample().

When resampling 32kHz->16kHz common_voice_cs_26209290, the audio.max() becomes 1.0079

In streaming_decode.py in Icefall, there is a check that max audio sample must be s<=1.0

What would be the cleanest solution to this ? a) Stop checking for audio.abs().max()<=1.0 in Icefall. b) Introduce audio clipping AudioTransform to lhotse. c) Introduce audio Limiter AudioTransform (sth. like: if audio.abs().max() > 0.99: rescale_to_099(...)) to lhotse. d) Try to add a check to torchaudio, into torchaudio.transforms.Resample().

I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.

Best regards Karel

// Ps: All the best in the new "western" year !!

csukuangfj commented 10 months ago

The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi.

I think it is safe to enlarge the range as long as we can achieve the same goal.

pzelasko commented 10 months ago

Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is.