hcmlab / vadnet

Real-time Voice Activity Detection in Noisy Eniviroments using Deep Neural Networks
http://openssi.net
GNU Lesser General Public License v3.0
415 stars 76 forks source link

Comparison against other approaches? #9

Closed AdolfVonKleist closed 5 years ago

AdolfVonKleist commented 5 years ago

Hi, I have just been trying this lib out and it seems to work very well, much better than WebRTC or other available frameworks that I have tried, and is extremely fast (0.05xRT on average) but my tests have still been largely adhoc. I went through your Interspeech paper:

but I could not find any direct comparison against other methods. Do you perhaps have any other publications or results that address this topic?

shelm commented 5 years ago

Hi,

thank you very much for your comment. To the best of my knowledge, vadnet has not (yet) been used outside of the interspeech challenge last year, where we compared our learned features against handcrafted ones, with respect to their performance in recognizing emotions. I will leave this comment open anyhow in case @frankenjoe would like to add something.

frankenjoe commented 5 years ago

In the INTERSPEECH paper you mentioned the proposed architecture was compared with other methods, mainly conventional ones based on hand-crafted features, but also two deep learning approaches (end2you and auDeep). However, this was in a different context (emotions & crying), yet it showed that it gives comparable results. We see vadnet just as another use case for a different task, but on a much larger data set, which obviously suits an end-to-end approach. Also the main idea was to create a system that runs real-time, which is much harder to evaluate anyway. To cut a long story short, there is no evaluation other than people like you saying they find it useful, which I think is also some a sort of evaluation :)

AdolfVonKleist commented 5 years ago

@frankenjoe thanks for your reply. I am also interested in the performance against 'traditional' methods, the WebRTC VAD, or even the 'classic' approach, as described in:

my impression thus far was that it compares very favorably against these non DNN alternatives, [and as you mention the referenced literature on auDeep, and the IDIAP work suggest the same] while still being very performant, but I wondered if maybe there was some additional direct formal literature I might have missed in that direction. In any case it looks like it is worth making that effort directly myself. Thanks again for sharing it and your additional thoughts.