jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
834 stars 232 forks source link

How do the results have to be interpreted #8

Open vladfulgeanu opened 6 years ago

vladfulgeanu commented 6 years ago

Hello!

I tried to use the python implementation to detect voice for the first 100s from this video: https://www.youtube.com/watch?v=gYdHyeo0eec

And these are the results on the spectogram: screenshot

First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music? And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?

Thanks very much in advance!

jtkim-kaist commented 6 years ago

Actually the training set of our vad doesn't have music sound. The result you uploaded can become better if some post-processing is applied. But we have only matlab version post processing. Anyway, the matlab script is quite simple, you can easily change that script into python. (if you are good at the numpy) If you needed, I will share that post-processing script for you

vladfulgeanu commented 6 years ago

@jtkim-kaist Is there an estimated date for a full python implementation (training & testing - with post processing)?

jtkim-kaist commented 6 years ago

@vladfulgeanu We upload the end-point detection (EPD) algorithm to https://github.com/jtkim-kaist/end-point-detection (it may be you want because EPD find the start and end point of the speech signal.)

However, the used VAD in EPD project is shallow CNN based so the performance may be worse than this project. Also the hyperparameters may be changed for your applications.

The work that joining this project and EPD will be done in someday, but it is hard to indicate the exact date.