Closed mvcisback closed 9 years ago
Since there's only one speaker, We're looking at the log energy at each frame. Look up papers on speech activity detection, there are fancier models out there to distinguish speech activity from noise activity.
On Saturday, October 25, 2014, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:
Given:
- Speech is likely to occur at certain frequency range, R
- Assume video and audio are synced
- Time=warping could done to assure this
- Only 1 speaker is speaking
Then the speech corresponds to activation in the spectrogram in the range R.
Finding the corresponding frame is then a matter of mapping the audio sample to the associated frame. Because the frame rate is much slower than the audio sample rate, some simple binning will be required.
— Reply to this email directly or view it on GitHub https://github.com/mvcisback/CS598ps_project/issues/5.
Thanks, Best Regards, Ramin
Comparison of Voice Activity Detection Algorithms for VoIP http://homepage.tudelft.nl/w5p50/pdffiles/Comparison%20of%20Voice%20Activity%20Detection%20Algorithms%20for%20VoIP.pdf
Nice! For small windows I've also seen a lot of papers discussing Teager-Kaiser energy. I don't know what it is exactly, but it seems like an alternative measure of the activity that out performs just measuring the intensity for small windows.
Yeah there is like a decades of papers on the subject! I think for our sake, we can do a simple energy detection. Worse case scenario if there is noise, then we do spectral subtraction first to get rid of it then we look for high energies in the signal.
So I think this ticket may be too general so I'm going to close it for now as we split things into smaller issues
Given:
Then the speech corresponds to activation in the spectrogram in the range R.
Finding the corresponding frame is then a matter of mapping the audio sample to the associated frame. Because the frame rate is much slower than the audio sample rate, some simple binning will be required.