mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.14k stars 3.95k forks source link

Imported 8khz training audio compromised by unfiltered upsampling #1726

Closed khsinclair closed 5 years ago

khsinclair commented 5 years ago

The audio for the Fisher training corpus, and possibly Switchboard as well, is originally 8khz sample rate with ulaw encoding. The import_fisher.py script converts it to 16khz sample rate PCM for training. (I'm not sure about import_swb)

The problem is that upsampling is done with a python audioop.ratecv() primitive that does no filtering at all, leaving the high band from 4khz-8khz with an image of the voice band. I'll attach spectrograms to illustrate.

Properly bandlimited upsampling is not as simple as it sounds. Julius Smith has a good explanation and lists a number of good implementations here: https://ccrma.stanford.edu/~jos/resample/resample.html

The function _split_and_resample_wav in import_fisher.py should use a better upsampler. I think the sox utility, as used in the deepspeech python client, does it right by default.

khsinclair commented 5 years ago

Attached is an Audacity screenshot with two spectrograms of the same phrase, originally recorded with 8khz sample rate. The top spectrogram is upsampled to 16khz using audioop.ratecv() exactly as import_fisher.py uses it. You can see the frequency content from 4k-8k is a reflected image of the lower band. The bottom spectrogram was upsampled to 16khz using the sox command line, and the high band has been properly suppressed. 2018-11-15_06h42_26

khsinclair commented 5 years ago

And here's the spectrum of the first vowel in the phrase, the shaded region in the top spectrogram. There's probably only a few mel filterbanks in the high band, but that's a significant source of noise in the training set. 2018-11-15_06h41_11

kdavis-mozilla commented 5 years ago

@khsinclair I spot checked the audio imported from Switchboard and it seems fine

spectrum

khsinclair commented 5 years ago

That Switchboard spectrogram is for the 8khz source audio. Where in the import/training pipeline does it get upconverted to 16khz, and is that properly done?

kdavis-mozilla commented 5 years ago

@khsinclair Agreed, and interesting. Let me see if I pulled that from the right Switchboard export on our server.

kdavis-mozilla commented 5 years ago

@khsinclair It's from the correct Switchboard export, but 16bit Signed PCM at 8kHz. Wow!

kdavis-mozilla commented 5 years ago

@khsinclair Another great catch! And nice on our side as we'll get added performance simply fixing the importer.

kdavis-mozilla commented 5 years ago

Just opened issue #1740 (Switchboard Importer Creates WAV's at 8KHz not 16KHz)

kdavis-mozilla commented 5 years ago

@khsinclair Upconversion happens in the mfcc feature computation. But I'm gong to fix the importer to make sure it's done correctly

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.