Closed haha912 closed 6 years ago
The paper "J. Kim and M. Hahn, "Voice Activity Detection Using an Adaptive Context Attention Model," in IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1-1." described as follows
For training, we used the training dataset in the TIMIT corpus [35] as the speech dataset. Given that TIMIT utterances have considerably shorter silences than speech, class imbalance problems may occur. To address these problems, 2-s long silence segments were added before and after each utterance. We used the ground truth label in TIMIT. For the noise dataset, a sound effect library [36] containing approximately 20 000 sound effects was used. Initially, we randomly selected 5000 sound effects and concatenated them into a long sound wave. We then randomly selected an utterance from the silence-added TIMIT train dataset and added it to the long sound wave with a randomly selected SNR between −10 and 12 dB; this was repeated until the end of the long sound wave. We used FaNT [37] for adding the noise to the utterance; 95% of the training dataset was used to train the model, and the remaining 5% in model validation.
Thank you so much! I have another question about training code of bDNN, as shown in the code, it get a batch of training data with 'next_batch' function in every iteration
for itr in range(max_epoch):
train_inputs, train_labels = train_data_set.next_batch(batch_size)
but 'next_batch' only return a part of an utterance's feature (when batch_size < num_samples) or whole feature of an utterance ( when batch_size > num_samples, padding with zeros), that is to say, if there is 2000 utterances in the training set, and I set the max_epoch to 1000, then it only use 1000 utterances for training at most (when batch_size > num_samples), is that right? Sorry, I just learned DL not long time. Thanks again.
For example, Assume that you have 5 utterances for the training. And if we extract some acoustic features from these utterances at frame level, the 5 utterances become some matrix that has the shape (# samples, feature dimension) Here, the # samples are dependent to the utterance length and frame shift size. In general, neural network models are trained with mini-batch method. In the code, the next_batch function generate some mini-batch from the whole dataset. As you said, if the max_epoch is not enough to cover whole dataset, only some part of dataset will be used to train the model. For example if the whole dataset has the shape (5000, 50) and the mini-batch size is 50 and the epoch is just 50, only 2500 samples are used to train. However, in general, the epoch size is set to be enough. The reason why the padding is to treat the last mini-batch. Also, literally, the epoch is the times of updates using whole dataset not per the mini-batch update. Therefore, the naming of epoch is little bit incorrect.
Thank you! I have tried different batch_size and different 'epoch', on a data set which combine TIMIT with NoiseX-92(about 1800 files), but i still not able to train an effective '.pb' like the one you provided in the backup folder (T^T), maybe i am wrong somewhere.
Hello, I am trying to use bDNN to distinguish human voice and natural sound. Could you tell me how many data you used for training the NN, and what kind data should be involved in the training set except the example data you have given, like noise data (dog's bark, knock, etc.) ?