jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
834 stars 232 forks source link

Training Datamake #12

Open shezanmirzan opened 6 years ago

shezanmirzan commented 6 years ago

As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises.

But the datamake you have uploaded in the speech enhancement toolkit does something different. It picks up random intervals from the long concatenated sound wave containing noises and mix it with different sound files.

So in the second case, one utterance of speech file doesn't get added to whole of the long concatenated noise. Instead a random interval of the long concatenated noise gets mixed with each sound file.

Can you explain why did you take first approach to create the dataset for training the VAD model? And second, how can I do the same thing you are doing? Should I use FaNT ? Or your make_train_noisy.m has options to do so?

jtkim-kaist commented 6 years ago
  1. VAD and speech enhancement (SE) dataset must be different. The reason is that SE assumes that incoming signal is always noisy speech not noise only. We also have exploited the VAD dataset (first approach as you mentioned) to train the SE model, however, as expected, failed to train (The problem to solve become more hard because of the noise only segments). In contrast, for the VAD dataset, the ratio between noise only and noisy speech segments should be almost equal in order to prevent the class imbalance problem. These are the reason why we follow different methods to make the VAD and SE dataset. To make the VAD dataset, just follow the way you mentioned "As per your comment in one of the closed issue, you mentioned that you concatenate different sound effects to make one long sound wave containing noises and then pick a random speech utterance and add that speech utterance to noise files at various SNRs until the end of the long sound wave of noises." Additionally YOU MUST VERIFY THE RATIO BETWEEN NOISE (labeled to 0) AND SPEECH SEGMENTS (labeled to 1), ideally, 1 : 1 is the best.

  2. Both the fant tool and v_add_noise.m in voicebox (implemented by MATLAB) follow the ITU standard. Therefore, according to my experiments, they didn't show significant difference so that I prefer to use voicebox because of easy implementation. Use anything you want. The make_train_noisy.m is ONLY for the speech enhancement toolkit.

shezanmirzan commented 6 years ago

I have one more doubt, I would be grateful if you help me out with this. Suppose even I create a long file containing noise of 40 mins and suppose the speech utterance is of 1 minute. On using v_addNoise, it just gives me an output noisy speech of 1 minute in which a noise interval is randomly picked up from the long noise file and added to the speech.

According to you, we however need to add the same 1 minute file 40 times to the long noise file at different SNRs. How do we do that using V_AddNoise? Is it even possible using v_addNoise or should I try with the FaNT tool?

However, a big thanks for clearing my above doubt.

jtkim-kaist commented 6 years ago

the inputs for vaddnoise.m should be step 1 noise(1:length(speech1)), speech1 step 2 noise(length(speech1)+1 :length(speech1)+1 + length(speech2) ), speech2