Closed khsinclair closed 5 years ago
Attached is an Audacity screenshot with two spectrograms of the same phrase, originally recorded with 8khz sample rate. The top spectrogram is upsampled to 16khz using audioop.ratecv() exactly as import_fisher.py uses it. You can see the frequency content from 4k-8k is a reflected image of the lower band. The bottom spectrogram was upsampled to 16khz using the sox command line, and the high band has been properly suppressed.
And here's the spectrum of the first vowel in the phrase, the shaded region in the top spectrogram. There's probably only a few mel filterbanks in the high band, but that's a significant source of noise in the training set.
@khsinclair I spot checked the audio imported from Switchboard and it seems fine
That Switchboard spectrogram is for the 8khz source audio. Where in the import/training pipeline does it get upconverted to 16khz, and is that properly done?
@khsinclair Agreed, and interesting. Let me see if I pulled that from the right Switchboard export on our server.
@khsinclair It's from the correct Switchboard export, but 16bit Signed PCM at 8kHz. Wow!
@khsinclair Another great catch! And nice on our side as we'll get added performance simply fixing the importer.
Just opened issue #1740 (Switchboard Importer Creates WAV's at 8KHz not 16KHz)
@khsinclair Upconversion happens in the mfcc feature computation. But I'm gong to fix the importer to make sure it's done correctly
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
The audio for the Fisher training corpus, and possibly Switchboard as well, is originally 8khz sample rate with ulaw encoding. The import_fisher.py script converts it to 16khz sample rate PCM for training. (I'm not sure about import_swb)
The problem is that upsampling is done with a python audioop.ratecv() primitive that does no filtering at all, leaving the high band from 4khz-8khz with an image of the voice band. I'll attach spectrograms to illustrate.
Properly bandlimited upsampling is not as simple as it sounds. Julius Smith has a good explanation and lists a number of good implementations here: https://ccrma.stanford.edu/~jos/resample/resample.html
The function _split_and_resample_wav in import_fisher.py should use a better upsampler. I think the sox utility, as used in the deepspeech python client, does it right by default.