lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
936 stars 214 forks source link

Audio range out of (-1,+1) #1254

Open KarelVesely84 opened 9 months ago

KarelVesely84 commented 9 months ago

Hello @pzelasko, @csukuangfj, I just identified an open question related to the audio transforms.

In lhotse, there is the Resample class wrapping the torchaudio.transforms.Resample().

When resampling 32kHz->16kHz common_voice_cs_26209290, the audio.max() becomes 1.0079

In streaming_decode.py in Icefall, there is a check that max audio sample must be s<=1.0

What would be the cleanest solution to this ? a) Stop checking for audio.abs().max()<=1.0 in Icefall. b) Introduce audio clipping AudioTransform to lhotse. c) Introduce audio Limiter AudioTransform (sth. like: if audio.abs().max() > 0.99: rescale_to_099(...)) to lhotse. d) Try to add a check to torchaudio, into torchaudio.transforms.Resample().

I guess a similar issue would appear also for volume perturbation, but I did not check that specifically.

Best regards Karel

// Ps: All the best in the new "western" year !!

csukuangfj commented 9 months ago

The -1 to 1 check in icefall is to avoid the issue when users pass samples in the range -32768 to 32767 to the model, which is the default behavior in Kaldi.

I think it is safe to enlarge the range as long as we can achieve the same goal.

pzelasko commented 9 months ago

Hi Karel! We had another issue related to this somewhere. Technically we could either add conditional rescaling (if np.max(np.abs(audio)) > 1.0, then divide audio by maxabs value) or a limiter (I have one in a separate pip package https://github.com/pzelasko/cylimiter), but I'm just not sure if it's worth paying the runtime cost. If it's not a strict requirement in Icefall I think it's OK to leave it as it is.