hcmlab / vadnet

Real-time Voice Activity Detection in Noisy Eniviroments using Deep Neural Networks
http://openssi.net
GNU Lesser General Public License v3.0
426 stars 77 forks source link

I have some problems with this project #13

Open JunGenius opened 5 years ago

JunGenius commented 5 years ago

Hello ,author! First of all, thank you very much for providing me with the ideas I realized.Then I have some questions: 1) I have noticed that the neural network makes a classification decision each 1 second of audio,but It is possible to include speech and noise in one second, such as 30% noise and 70 voice, how to distinguish them? 2) If a voice lasts for 1.2 seconds, the next 0.2 seconds of vocals may be classified as noise, resulting in incomplete speech segments, so how to solve this problem? 3) I want to reduce the classification time, such as 500ms or 250ms, then whether to separate the training speech and noise into a file size of 500ms or 250ms, and then retrain a new model, so will it lead to a decline in the recognition rate?

I am looking forward to your answer, thank you again.

frankenjoe commented 5 years ago
  1. No, a decision is made per frame (e.g. second). But you can do two things: train your network on a shorter window size (see e.g. #7) and increase overlapping, e.g. make a prediction every 0.1 s, and apply some post-processing to the sequence of decisions afterwards.
  2. Again, I suggest to increase overlapping between frames.
  3. See #7
JunGenius commented 5 years ago

OK,Thank you very much.