augmentation/torchaudio: add Phone effect (mulaw, lpc10 codecs)

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

https://lhotse.readthedocs.io/en/latest/

Apache License 2.0

902 stars 204 forks source link

augmentation/torchaudio: add Phone effect (mulaw, lpc10 codecs) #1348

Open rouseabout opened 1 month ago

rouseabout commented 1 month ago

This patch adds a audio codec transformation.

I have found that when applying K2 ASR to speech compressed with mulaw, it is advantageous to augment the training data with these codecs. The transformation resamples the input audio to 8kHz, encodes then decodes using specified codec, then restores the original sample rate (e.g. 16 kHz).

Open issues:

[ ] The transformation is called phone(). But maybe a better name is needed?
[ ] Since it significantly alters the audio, depending on codec, I am wondering how best to test the transformation?

Example use:

cs2 = CutSet.from_manifests(...).phone(codec="mulaw")
cs3 = CutSet.from_manifests(...).phone(codec="lpc10")

libspandsp is required to use the lpc10 codec. Use apt-get install libspandsp-dev on Debian/Ubuntu.

rouseabout commented 4 weeks ago

I have addressed everything except for restore_orig_sr=True. I am not sure how to achieve that!

pzelasko commented 3 weeks ago

I have addressed everything except for restore_orig_sr=True. I am not sure how to achieve that!

You are very close! Add a parameter restore_orig_sr=True in def narrowband(self, ...) for cut and recording, and pass the provided argument to Narrowband constructor. Then you can extend the condition for the second resampling to if self.restore_orig_sr and sampling_rate != 8000).

rouseabout commented 1 week ago

Done, but something extra is needed, because when I apply the transformation with use_orig_sr=False the following exception occurs:

AudioLoadingError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0, duration=19.22419501133787). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=693887, audio.shape=(1, 153900), recording=Recording(id='0_nb_lpc10', sources=[AudioSource(type='file', channels=[0], source='/home/user/workspace/rtvalid/0.wav')], sampling_rate=44100, num_samples=847787, duration=19.22419501133787, channel_ids=[0], transforms=[{'name': 'Narrowband', 'kwargs': {'codec': 'lpc10', 'restore_orig_sr': False}}])

pzelasko commented 1 week ago

If you don't restore orig sr, you'll have to update both sampling_rate and num_samples property on the Recording object.