hcmlab / vadnet

Real-time Voice Activity Detection in Noisy Eniviroments using Deep Neural Networks
http://openssi.net
GNU Lesser General Public License v3.0
426 stars 77 forks source link

Make a decision each X second or ms #7

Open anavc94 opened 5 years ago

anavc94 commented 5 years ago

Hello again,

After some days off I have returned to this project and I have some new questions. I have noticed that the neural network makes a classification decision each 1 second of audio, so I was wondering if I could decrease this interval to a smaller value. For example, classifying each 250ms of audio into Noise/Voice and have a better "precision" when discriminating these classes.

The script I am using is vad_extract.py. I know the input layer receives the samples of each second, make the prediction and then store the probabilities in the labels variable. So, as I'm thinking, the approach will be change the size of the input and final layers to receive samples of, for example, 250ms, make a prediction and stored the probabilies for each 250ms unit in labels. Am I going well? As you did it each second, do you think that discriminate noise/voice in smaller audio units is a good idea?

Btw, I don't know if 'issues' is the best apart for these kind of questions because actually it's not a problem (in fact, your project was really friendly to install and use), I am sorry if you consider it's not.

Regards, Ana

frankenjoe commented 5 years ago

Hi Ana,

you will have to retrain the model using a smaller window size. Therefore go to the train folder and change the variable FRAME and STEP accordingly. Note that the window size is FRAME + STEP. Save the file and run do_all.cmd. Note that maybe you will also have to adjust the network in case the new window size is too small to be processed by the network. In that case try to reduce the stride during convolution or pooling (CONV_STRIDE and CONV_POOL_STRIDE). Frankly, I don't know if that will decrease / increase the performance of the noise detection. So I would be happy if you shared your results in this thread.

All the best, Johannes.

nleguillarme commented 5 years ago

Hi everybody, I am currently trying to learn a model (Conv3Rnn with 2 RNN layers) to classify small frames (40ms with 50% overlap), with no success at the moment (accuracy stucked between 70 and 80% on the training set, about 70% on the validation file, precision and recall oscillating). Any known problem of class imbalance with this dataset ? Concerning the FRAME and STEP variables, does this mean that if I want to process 20ms frames, I must set FRAMES to 480 (10ms) and STEP to 480 ?

frankenjoe commented 5 years ago

Did you already try to balance the data by running training with the --balance Up or --balance Down option?

Regarding your second question: FRAMES is the size of the processing window, i.e. to process chunks of 20 ms set FRAMES = 960 (assuming the sample rate is 48kHz). STEP, on the other hand, defines how much we move the window to extract the next frame. If we set STEP = 960, for instance, there will be no overlap between successive frames (default value). But if we set STEP = 240, for instance, we have a 75% overlap.

nleguillarme commented 5 years ago

I did not try the balance command. After examinating the data, it seems that the ratio varies between 50:50 and 75:25, which does not seem that imbalanced to me.

I'm sorry, I do not really understand your last point. In the file do_train.cmd, you set FRAMES to 48000 and STEP to 24000, and yet the resulting network classifies 1 second frames, not 1.5 seconds frames ? Similarly, if I set FRAMES to 960 and STEP to 480, the sample_from_file function returns an array of frames of shape (nbFrames, 960), not (nbFrames, 960+480)...

frankenjoe commented 5 years ago

Sorry, my bad. FRAMES defines the processing window and STEP how much the window is moved. I corrected the answer accordingly. Thank you for pointing out my mistake.

nleguillarme commented 5 years ago

You are welcome ! Thank you for the lib and the support.

anavc94 commented 5 years ago

Hi everybody,

Gonna share my results, just in case you find them useful. I have trained the network following the instructions of @frankenjoe in order to make predictions each 250ms and 500ms. The main idea was studying how it perfoms when trying to detect small frames of speech, for example, when someone fastly says "¡No!" or things like that, because the initial network miss them. With a smaller window size it can detect them, at least in the kind of content I am testing it (a serie), but the output segments are less "stable". I mean, for example when some kind of noise appears in the audio, the initial network can predict successfully the inital and final seconds (let's call them X and Y) but with smaller windows size the predicted label varies a lot between the same X to Y interval.

Regards, Ana

nleguillarme commented 5 years ago

Hi Ana, thank you for sharing your results. Did you train a Conv7RNN or another architecture ? Do you mind sharing your config file ? For my part, I am still not successful with the Conv2RNN + 2RNN architectures on 45 ms frames...

anavc94 commented 5 years ago

Hello @nleguillarme

Yes, I'm training the Conv7Rnn. I am suspecting something went wrong while training, because loss doesn't decrease and acc doesn't increase, and I expected the predicted labels were stable. I mantain the 5 epochs of the original do_train.cmd thinking they would be enough, maybe I should make more iterations. With the config file do you mean do_train.cmd?

nleguillarme commented 5 years ago

Yes, but don't bother if you didn't make any change to the original file. Same thing for me here, the accuracy and the loss remain quite stable, no matter the number of training epochs (reached about 15 epochs with the Conv2RNN architecture)

anavc94 commented 5 years ago

Any problem @nleguillarme , here's my config file for 500ms:

`@echo off

SET RETRAIN=True SET ROOT=data SET OUTPUT=nets

SET LEARNING_RATE=0.0001 SET N_EPOCHS=5 SET N_BATCH=512 SET SAMPLE_RATE=48000 SET FRAME=24000 SET STEP=12000

SET NETWORK=network.crnn.Conv7Rnn SET CONV_STRIDE=2 SET CONV_POOL_STRIDE=1 SET CONV_POOL_SIZE=4 SET RNN_APPLY=False SET RNN_LAYERS=2 SET RNN_UNITS=128

REM SET EVAL_AUDIO=eval.wav REM SET EVAL_ANNO=eval.annotation REM SET EVAL_THRES=0 REM SET LOG_FILENAME=True

python code\main.py --source source.audio_vad_files.AudioVadFiles --model model.model.Model --trainer trainer.adam.SceAdam --retrain %RETRAIN% --sample_rate=%SAMPLE_RATE% --n_frame %FRAME% --n_step %STEP% --files_root %ROOT% --files_filter *.info --files_audio_ext .m4a --files_anno_ext .voiceactivity.annotation --output_dir=%OUTPUT% --learning_rate %LEARNING_RATE% --n_batch %N_BATCH% --network %NETWORK% --conv_stride %CONV_STRIDE% --conv_pool_stride %CONV_POOL_STRIDE% --conv_pool_size %CONV_POOL_SIZE% --rnn_apply %RNN_APPLY% --n_rnn_layers %RNN_LAYERS% --n_rnn_units %RNN_UNITS% --n_epochs %N_EPOCHS% REM --eval_audio_file %EVAL_AUDIO% --eval_anno_file %EVAL_ANNO% --eval_blacklist_thres %EVAL_THRES% --log_filename %LOG_FILENAME%

pause`

I changed FRAME, STEP and CONV_POOL_STRIDE from 2 to 1.

Thanks for sharing results, will update with any news!

anavc94 commented 5 years ago

Hello,

related to the fact that I am suspecting I haven't trained the network enough epochs in order to perfom smaller window detections, I wonder how can I continue the training from a previous checkpoint. I thought I just had to change the param "RETRAIN" from True to False but the model starts a new complete training anyway. Is it neccessary to do something more?

Thanks again, Ana

frankenjoe commented 5 years ago

In fact, it's the other way round: if RETRAIN is True the system will try to continue from the last checkpoint, otherwise training starts from scratch. However, I admit the naming is not particularly fortunate :-(

anavc94 commented 5 years ago

Thanks @frankenjoe , I find the name a bit confusing haha.

On the other hand, I am starting to get better results with 250ms frames. My problem was that I didn't train the network enough epochs as I thought the "default" 5 were enough. Definitely not a good practice!

Another thing I find interesting to focus on is the way vadnet classificates music + voice of the singer. If I am not wrong, the music is 'noise' in the training set but the predicted label of music + singing voice is not always 'noise'. Just find it interesting and maybe important for some applications.

Have a nice day! Ana

nleguillarme commented 5 years ago

Hi @frankenjoe, I have a question concerning the RNN layers : what is considered a sequence in your implementation ? My guess is that the RNN internal states are reset after each audio frame, am I right ?

frankenjoe commented 5 years ago

@nleguillarme: yes that is correct

awgu commented 5 years ago

Hey @anavc94, in the end, how were you results with the 500 ms frame size and the 250 ms frame size? If they worked well, if you do not mind, can you share the trained model?

anavc94 commented 5 years ago

Hey @anavc94, in the end, how were you results with the 500 ms frame size and the 250 ms frame size? If they worked well, if you do not mind, can you share the trained model?

Hello @andwgu ! I could improve my results with 500ms and 250ms frame size, but they're not as good as what I could get with a 1s frame size. I have not worked in this project since my last update.

I wouldn't mind to share my models with you all, but unfortunately I am not allowed, I am sorry. However, try to follow the steps I did and I am pretty sure you will get good results soon, it didn't take so much time for me and I am a beginner :p

Regards! Ana

santichialvo commented 4 years ago

Hi @anavc94, how are you?

Do you think it's possible to work with a sort of combination of both networks? For example, first run the network that works with a 1-second frame size. Then, analyze the second after or before of the intervals given to achieve the 250/500 ms accuracy.

Regards, Santiago.