Open meixitu opened 6 years ago
@jtkim-kaist , Sorry for disturb you. Could you help me?
Thanks Jinhong
Thank you for the detailed question because there are several questions, let's solve your questions step by step.
Your claim may be correct. The TIMIT labels are not perfectly correct in your aspect. However, in general, the VAD labels are made in utterance unit. I think this is because most of VAD related applications such as speech recognition, should be conducted in the utterance level. Ideally, perfect frame-level labels will be good, however, it is almost impossible to make it (too many times to make). In summary, the VAD is frame-wise classifier so that your claim is correct in ideal aspect, however, conventionally, the short silence between vowel sound (the region where you pointed in your figure 1.1) considered as speech so that the high probability is the correct result. If you want to remove the short silence, the unvoiced/voiced sound classifier is more appropriate. Note that in general VAD is used to detect the speech in utterance level after decision smoothing (VAD + post processing called end-point detection), also in general, the neural network is trained based on averaged gradients computed from mini-batches, small size of noisy sample don't affect the model performance.
I used the .phn files. I labeled to 1 for the phoneme region
If your question 1, 2 are solved, I will answer the remained questions. Thx
Hi @jtkim-kaist , Thank you very much for you reply.
Thanks Jinhong
Hi @jtkim-kaist, I read the code in detail. For these three models, ACAM, DNN and BDNN sample 7 frames from 39 frames as the NN input, I think it is reasonable to label these 7 frames as speech if the silence in speech is less than 390ms(this condition should be right for most of the speech). And for LSTM model, it stack 25 continuous frames as the NN input, if the silence in speech is less than 250ms, it is still reasonable to label it as speech. So I think the label is correct, it will not degrade the performance.
Could you please explain other question?
Actually, I found the performance of ACAM is still not perfect, For example: I hear this voice, 0.5~0.7 seconds is speech 'seven', but the probability is less than 0.4, which is the threshold in your matlab code.
Thanks Jinhong
I'm really sorry for late answer, these days I'm so busy : ( .
For the above questions:
Maybe, your tested speech is from Aurora which contain short utterance and sampled with 8,000 Hz.
According to my experience, The VADs in this project perform rather in poor when using 8kHz dataset, while they are upsampled to 16kHz. (In contrast, VADs work well in the dataset which has higher sampling rate than 16kHz) The reason might be VADs are trained by using 16kHz dataset only.
The lower probability means that the VAD predicts the decision with low confidence.
To solve this problem,
post-processing : https://github.com/jtkim-kaist/end-point-detection
For the remained questions,
Right
That m-file may be from my legacy project. (In that project, the values of label are 0.1 not 1 ) Skip that file.
Both the bDNN and ACAM use the boosting concept which means that they outputs multiple frames not just one frame. The cross entropy you mentioned is used to train ordinary classification network with softmax output layer. For the detail, refer the https://ieeexplore.ieee.org/document/7347379/
and compare dnn with bdnn based VAD in this project.
Hi @jtkim-kaist ,
Sorry disturb you too many times. You really help me much, thank you!
I get the data from internet, and it is 16KHz. I will do more survey.
I guess maybe the problem is. 1) how to make sure the normalization is perfect to totally different voice recorder(such as PDM)? I found this speech don't have the same mean and variance with you pre-saved.
2) Is it OK to normalize the feature for each NN input(the stack of several frame)? Seems you only stack 7 frames as input, calculate mean and variance in only 7 data should not that good.
Thanks Jinhong
To solve this kind of problem, we have to use situation-robust feature. For example, the MRCG feature in this project, normalize the power of speech when calculate the feature value so that it is robust to energy variation, which means, the VAD perform well regardless of the distance, according to my investigation.
Hi @jtkim-kaist,
Yes, MRCG feature keep same if we change the volume.
Let me do more study about it.
Thanks for your help.
Thanks Jinhong
Hi @jtkim-kaist ,
Sorry I have another question.
I try to test the performance for different models.
I found in the test.m, vad_func.m, graph_test.py, for DNN and LSTM model, you use softmax layer input to compare with threshold(0.4), and make decision in matlab.
Is it right?
I think softmax layer output is a better choice.
And I can't get the softmax layer output, because softmax output is not a node in your models.
Thanks Jinhong
NO, the threshold is used only in bdnn and ACAM, please refer the model definition of DNN and LSTM, their prediction is conducted by argmax function across the softmax dimension.
Hi @jtkim-kaist , Now I can train with my dataset. Thanks for your help. And I have some optimal issue, hope you have time to help me, and I hope it will not waste you a lot of time.
I found in your paper, the optimal training parameter of each model is random searched.
Could you share to me the optimal parameters?
it is really need much time to search the optimal parameter.
1. learning rate of each models.
2. learning rate decay rate and learning rate decay freq of each model
3. Could you help me how to design the early stop in training?
In ACAM model, the current code(it is comment) will stop if mean_accuracy>=0.991(it is 0.968 for LSTM), is this same as what you proposed?
and for the other two models don't have this function.
4. How many N*4096 is used when you stop training? 4096 is batch size.
Thanks Jinhong
Hi @jtkim-kaist ,
Thanks for your great project, it helps me much.
I have several questions about this project, Could you help me? Thanks you in advance.
1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period.
1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?
Thanks Jinhong