otalk / hark

Converts an audio stream to speech events in the browser
571 stars 96 forks source link

investigate using RFC 6465 algorithm for audio level calculation / speaking events #6

Open fippo opened 10 years ago

fippo commented 10 years ago

I suspect the current max(freq) strategy for is somewhat unstable, since it just takes the maximum and ignores the frequency.

getByteTimeDomainData may enable us to calculate root mean square according to http://tools.ietf.org/html/rfc6465#appendix-A.1 If that doesn't work... we'll have to figure out something with the FFT data.

jokesterfr commented 10 years ago

I agree with this, do you have something new regarding a more efficient vad technic, involving frequency ranges? I read human voice is often between [100Hz - 1000Hz]

fippo commented 10 years ago

http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_June2013.pdf is what I would currently prefer (not the dominating speaker aspect, but the others). But time...

jokesterfr commented 10 years ago

That's seems too complicated for me to help. I mean, I'm a developer, not a PHD researcher in vocal recognition. However if you already know about some implementations of a good VAD script, in what ever language on earth it is, I can give it a try.