introlab / odas

ODAS: Open embeddeD Audition System
MIT License
780 stars 248 forks source link

How to improve voice recognition using odas? #54

Open JaySpeech opened 6 years ago

JaySpeech commented 6 years ago

When two people speak at the same time,there are four channels in the postfilter output. Which channel shoud I send to voice recognition engine.

FrancoisGrondin commented 6 years ago

The channel with speech will match the position in the tracking results, e.g. if I have a tracked source in position 2, then the 2nd channel will contain the corresponding separated stream.

However, be careful: feeding the postfiltered output directly to a ASR system which was not trained on a dataset filtered in the same way will not produce any good result at all. There will be a domain mismatch. Postfiltering makes it easier for the human ear, but also introduces artifacts picked by the ASR. I suggest you use instead the separated stream, which does not include those artifacts.

Cheers!

JaySpeech commented 6 years ago

Thank you for your reply. If two people speak at the same time at different position,two sources will be tracked,which source is the best choice?

FrancoisGrondin commented 6 years ago

It really depends which source you want to use for voice recognition. However, let me warn you: you'll probably get quite bad ASR results if two people are speaking at the same time: each source will still corrupt the other one (even after separation... interference is reduced but not removed). Unless you use a language model with a very limited dictionary, and trained in such similar conditions, the WER will be quite high. Cocktail party is still a very hot research topic right now... deep learning makes things better, but we still have to work hard to get human like recognition performances.