jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
842 stars 235 forks source link

Training data #4

Closed haha912 closed 6 years ago

haha912 commented 6 years ago

Hello, I am trying to use bDNN to distinguish human voice and natural sound. Could you tell me how many data you used for training the NN, and what kind data should be involved in the training set except the example data you have given, like noise data (dog's bark, knock, etc.) ?

jtkim-kaist commented 6 years ago

The paper "J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1-1." described as follows

For training, we used the training dataset in the TIMIT corpus [35] as the speech dataset. Given that TIMIT utterances have considerably shorter silences than speech, class imbalance problems may occur. To address these problems, 2-s long silence segments were added before and after each utterance. We used the ground truth label in TIMIT. For the noise dataset, a sound effect library [36] containing approximately 20 000 sound effects was used. Initially, we randomly selected 5000 sound effects and concatenated them into a long sound wave. We then randomly selected an utterance from the silence-added TIMIT train dataset and added it to the long sound wave with a randomly selected SNR between −10 and 12 dB; this was repeated until the end of the long sound wave. We used FaNT [37] for adding the noise to the utterance; 95% of the training dataset was used to train the model, and the remaining 5% in model validation.

haha912 commented 6 years ago

Thank you so much! I have another question about training code of bDNN, as shown in the code, it get a batch of training data with 'next_batch' function in every iteration

       for itr in range(max_epoch):
             train_inputs, train_labels = train_data_set.next_batch(batch_size)

but 'next_batch' only return a part of an utterance's feature (when batch_size < num_samples) or whole feature of an utterance ( when batch_size > num_samples, padding with zeros), that is to say, if there is 2000 utterances in the training set, and I set the max_epoch to 1000, then it only use 1000 utterances for training at most (when batch_size > num_samples), is that right? Sorry, I just learned DL not long time. Thanks again.

jtkim-kaist commented 6 years ago

For example, Assume that you have 5 utterances for the training. And if we extract some acoustic features from these utterances at frame level, the 5 utterances become some matrix that has the shape (# samples, feature dimension) Here, the # samples are dependent to the utterance length and frame shift size. In general, neural network models are trained with mini-batch method. In the code, the next_batch function generate some mini-batch from the whole dataset. As you said, if the max_epoch is not enough to cover whole dataset, only some part of dataset will be used to train the model. For example if the whole dataset has the shape (5000, 50) and the mini-batch size is 50 and the epoch is just 50, only 2500 samples are used to train. However, in general, the epoch size is set to be enough. The reason why the padding is to treat the last mini-batch. Also, literally, the epoch is the times of updates using whole dataset not per the mini-batch update. Therefore, the naming of epoch is little bit incorrect.

haha912 commented 6 years ago

Thank you! I have tried different batch_size and different 'epoch', on a data set which combine TIMIT with NoiseX-92(about 1800 files), but i still not able to train an effective '.pb' like the one you provided in the backup folder (T^T), maybe i am wrong somewhere.