unclear code in speech_to_text, needs a comment explaining "why"

watson-developer-cloud / speech-javascript-sdk

Library for using the IBM Watson Speech to Text and Text to Speech services in web browsers.

https://watson-speech.mybluemix.net/

260 stars 133 forks source link

unclear code in speech_to_text, needs a comment explaining "why" #30

Closed mreinstein closed 7 years ago

mreinstein commented 7 years ago

There is some filtering logic happening here: https://github.com/watson-developer-cloud/speech-javascript-sdk/blob/master/speech-to-text/webaudio-l16-stream.js#L84

@nfriedly mentioned this had something to with antialiasing downsampling, but neither of us could come up with clear wordage on why that code is present or what exactly it's doing.

This is one of the few areas in the code which is very hard to understand by simpling reading it; the rest of the module is pretty self-documented and easy to follow. :)

nfriedly commented 7 years ago

I posted my own question on the DSP stack exchange to help get a better handle on this. It includes a nice graph of the filter array :)

http://dsp.stackexchange.com/questions/37450/please-help-me-understand-this-audio-downsampling-code

mreinstein commented 7 years ago

nice! So you're planning to support downsampling to 8khz too? Is this to minimize bandwidth sent to watson? I'm wondering if going below 16khz actually affects recognition quality?

nfriedly commented 7 years ago

Well, the service supports 8kHz "narrowband" models and 16kHz "broadband" models. Right now, when you select a narrowband model, the audio is downsampled to 16kHz here, and then I think it gets downsampled again to 8kHz at the service. So, doing it once here should both save bandwidth and increase quality.

mreinstein commented 7 years ago

Ah, right I forgot about the narrowband model. So really this is about bandwidth savings when opting into the non-broadband model. Thanks for clarification.

mreinstein commented 7 years ago

@nfriedly very nice!

Question about this line:

Input audio is typically 48kHz, this downsamples it to 16kHz.

I've seen several browsers provide different sample rates. For example, Chrome right now produces 44.1khz samples. I'm wondering if this filtering logic is going to slightly distort this input?

nfriedly commented 7 years ago

Oh, let me adjust that. It will reduce the cutoff point of the low-pass filter slightly, but not enough to affect human speech.