jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
834 stars 232 forks source link

data and model question #13

Open meixitu opened 6 years ago

meixitu commented 6 years ago

Hi @jtkim-kaist ,

Thanks for your great project, it helps me much.

I have several questions about this project, Could you help me? Thanks you in advance.

  1. your prepared data, the label is not that accurate. the short silence in middle of whole speech is labeld as speech. i measured the length, sometime it exceed 100ms, does it degrade the performance? see the below two figure.
  2. for TIMIT, it has .phn, .txt and .wrd file for each audio. My question is how to label the data? do you label the whole audio as speech or use .wrd to label each word?
  3. the normalization. I find you get a mean and variance for each feature in the whole dataset
  4. in Truelabel2Trueframe.m, line 13, I don't know why you *10. the input is 0 or 1, Not 0.1
  5. In the ACAM model, I found the final full connected layer output is 7 dimension. and it is same as the input frame number, the activation function is sigmoid to get logit, and use tf.square(logit-labels) to calculate the cost function. My question is, if we set the connected layer output as 2 dimension, then softmax it, then use entropy cost function, the label can be 1 if sum(labels(n:n+6))>3, is it OK? it is very popular for classification task.

1.1 this imag, it is TIMIT test data, i use your saved model to run it, from 1.4s to 1.5s, the output probability is still very high. is it right? I think the probability should be reduce in this period. image

1.2 this is your clean_speech.wav, the green line is label. from 2.73s to 2.83s, it is >= 1 frame length, but all of them are labeled as speech, is it right?

image

Thanks Jinhong

meixitu commented 6 years ago

@jtkim-kaist , Sorry for disturb you. Could you help me?

Thanks Jinhong

jtkim-kaist commented 6 years ago

Thank you for the detailed question because there are several questions, let's solve your questions step by step.

  1. Your claim may be correct. The TIMIT labels are not perfectly correct in your aspect. However, in general, the VAD labels are made in utterance unit. I think this is because most of VAD related applications such as speech recognition, should be conducted in the utterance level. Ideally, perfect frame-level labels will be good, however, it is almost impossible to make it (too many times to make). In summary, the VAD is frame-wise classifier so that your claim is correct in ideal aspect, however, conventionally, the short silence between vowel sound (the region where you pointed in your figure 1.1) considered as speech so that the high probability is the correct result. If you want to remove the short silence, the unvoiced/voiced sound classifier is more appropriate. Note that in general VAD is used to detect the speech in utterance level after decision smoothing (VAD + post processing called end-point detection), also in general, the neural network is trained based on averaged gradients computed from mini-batches, small size of noisy sample don't affect the model performance.

  2. I used the .phn files. I labeled to 1 for the phoneme region

If your question 1, 2 are solved, I will answer the remained questions. Thx

meixitu commented 6 years ago

Hi @jtkim-kaist , Thank you very much for you reply.

  1. Yes, I understand. I found the performance is very good with my own test dataset with you pre-trained model, I will use *.phn to label the speech. It seems there is only start and end point of speech, the phoneme is continuously in the speech. I just want to repeat your training.

Thanks Jinhong

meixitu commented 6 years ago

Hi @jtkim-kaist, I read the code in detail. For these three models, ACAM, DNN and BDNN sample 7 frames from 39 frames as the NN input, I think it is reasonable to label these 7 frames as speech if the silence in speech is less than 390ms(this condition should be right for most of the speech). And for LSTM model, it stack 25 continuous frames as the NN input, if the silence in speech is less than 250ms, it is still reasonable to label it as speech. So I think the label is correct, it will not degrade the performance.

Could you please explain other question?

Actually, I found the performance of ACAM is still not perfect, For example: I hear this voice, 0.5~0.7 seconds is speech 'seven', but the probability is less than 0.4, which is the threshold in your matlab code. image

Thanks Jinhong

jtkim-kaist commented 6 years ago

I'm really sorry for late answer, these days I'm so busy : ( .

For the above questions:

Maybe, your tested speech is from Aurora which contain short utterance and sampled with 8,000 Hz.

According to my experience, The VADs in this project perform rather in poor when using 8kHz dataset, while they are upsampled to 16kHz. (In contrast, VADs work well in the dataset which has higher sampling rate than 16kHz) The reason might be VADs are trained by using 16kHz dataset only.

The lower probability means that the VAD predicts the decision with low confidence.

To solve this problem,

  1. Use lower threshold for VAD and conduct the post-processing (in this case, the post processing must be conducted because of prediction was carried out with low confidence)

post-processing : https://github.com/jtkim-kaist/end-point-detection

  1. If the result is still not satisfied, re-training is necessary using your dataset.

For the remained questions,

  1. Right

  2. That m-file may be from my legacy project. (In that project, the values of label are 0.1 not 1 ) Skip that file.

  3. Both the bDNN and ACAM use the boosting concept which means that they outputs multiple frames not just one frame. The cross entropy you mentioned is used to train ordinary classification network with softmax output layer. For the detail, refer the https://ieeexplore.ieee.org/document/7347379/

and compare dnn with bdnn based VAD in this project.

meixitu commented 6 years ago

Hi @jtkim-kaist ,

Sorry disturb you too many times. You really help me much, thank you!
I get the data from internet, and it is 16KHz. I will do more survey.

I guess maybe the problem is. 1) how to make sure the normalization is perfect to totally different voice recorder(such as PDM)? I found this speech don't have the same mean and variance with you pre-saved.

2) Is it OK to normalize the feature for each NN input(the stack of several frame)? Seems you only stack 7 frames as input, calculate mean and variance in only 7 data should not that good.

Thanks Jinhong

jtkim-kaist commented 6 years ago
  1. I cannot sure my normalization factor is perfect for every situation. However, if we use large amount of dataset which can represent the population mean and variance, the normalization factor will be perfect. But it is almost impossible so If you use some dataset which has the mean and variance, far from those of my own dataset, the performance will be degraded.

To solve this kind of problem, we have to use situation-robust feature. For example, the MRCG feature in this project, normalize the power of speech when calculate the feature value so that it is robust to energy variation, which means, the VAD perform well regardless of the distance, according to my investigation.

  1. Note that the pre-saved normalization factor is the global mean and variance of my dataset.
meixitu commented 6 years ago

Hi @jtkim-kaist,

Yes, MRCG feature keep same if we change the volume.

Let me do more study about it.

Thanks for your help.

Thanks Jinhong

meixitu commented 6 years ago

Hi @jtkim-kaist , Sorry I have another question. I try to test the performance for different models. I found in the test.m, vad_func.m, graph_test.py, for DNN and LSTM model, you use softmax layer input to compare with threshold(0.4), and make decision in matlab.
Is it right? I think softmax layer output is a better choice. And I can't get the softmax layer output, because softmax output is not a node in your models.

Thanks Jinhong

jtkim-kaist commented 6 years ago

NO, the threshold is used only in bdnn and ACAM, please refer the model definition of DNN and LSTM, their prediction is conducted by argmax function across the softmax dimension.

meixitu commented 6 years ago

Hi @jtkim-kaist , Now I can train with my dataset. Thanks for your help. And I have some optimal issue, hope you have time to help me, and I hope it will not waste you a lot of time.

I found in your paper, the optimal training parameter of each model is random searched. 
Could you share to me the optimal parameters?
 it is really need much time to search the optimal parameter.
1. learning rate of each models.
2. learning rate decay rate and learning rate decay freq of each model
3. Could you help me how to design the early stop in training?  
    In ACAM model, the current code(it is comment) will stop if mean_accuracy>=0.991(it is 0.968 for LSTM), is this same as what you proposed?
    and for the other two models don't have this function.   
4. How many N*4096 is used when you stop training? 4096 is batch size.

image

Thanks Jinhong