Detecting Speech in Audio + Cooresponding Frame + Single Speaker

mvcisback commented 9 years ago

Given:

Speech is likely to occur at certain frequency range, R
Assume video and audio are synced
- Time=warping could done to assure this
Only 1 speaker is speaking

Then the speech corresponds to activation in the spectrogram in the range R.

Finding the corresponding frame is then a matter of mapping the audio sample to the associated frame. Because the frame rate is much slower than the audio sample rate, some simple binning will be required.

ghost commented 9 years ago

Since there's only one speaker, We're looking at the log energy at each frame. Look up papers on speech activity detection, there are fancier models out there to distinguish speech activity from noise activity.

On Saturday, October 25, 2014, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

Given:

Speech is likely to occur at certain frequency range, R

Assume video and audio are synced

Time=warping could done to assure this

Only 1 speaker is speaking

Then the speech corresponds to activation in the spectrogram in the range R.

Finding the corresponding frame is then a matter of mapping the audio sample to the associated frame. Because the frame rate is much slower than the audio sample rate, some simple binning will be required.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/CS598ps_project/issues/5.

Thanks, Best Regards, Ramin

ghost commented 9 years ago

Comparison of Voice Activity Detection Algorithms for VoIP http://homepage.tudelft.nl/w5p50/pdffiles/Comparison%20of%20Voice%20Activity%20Detection%20Algorithms%20for%20VoIP.pdf

mvcisback commented 9 years ago

Nice! For small windows I've also seen a lot of papers discussing Teager-Kaiser energy. I don't know what it is exactly, but it seems like an alternative measure of the activity that out performs just measuring the intensity for small windows.

ghost commented 9 years ago

Yeah there is like a decades of papers on the subject! I think for our sake, we can do a simple energy detection. Worse case scenario if there is noise, then we do spectral subtraction first to get rid of it then we look for high energies in the signal.

mvcisback commented 9 years ago

So I think this ticket may be too general so I'm going to close it for now as we split things into smaller issues

mvcisback / SSLVC

Detecting Speech in Audio + Cooresponding Frame + Single Speaker #5