jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.
830 stars 230 forks source link

Question: Why was MRCG selected as input feature? #21

Closed seungwonpark closed 5 years ago

seungwonpark commented 5 years ago

Hi, recently I've been looking for deep-learning based VAD models and some googling brought me here. Thanks for open-sourcing your model! :)

My question is: why was MRCG used as an input feature?

To the best of my knowledge, STFT based mel-spectrograms (or linear-scale magnitudes, whatever) have been widely used as an input feature of recent deep-learning based acoustic models. Are there any strengths that MRCG have in VAD model, compared to other acoustic features like mel-spectrograms?

jtkim-kaist commented 5 years ago

Thank you for your interests!

The reasons are:

  1. MRCG is known as state-of-the-art features as VAD features, as MRCG includes multi-context cochleagram [1]. Anyway, we want to compare our proposed neural network architecture with bDNN proposed in [1], using MRCG features for VAD, so that we tried to use same features (MRCG) that of used in [1].
  2. Further, the one of good point of MRCG, that we found, is that MRCG is robust to distant-variation, as MRCG is power-normalized features, so that we expect that MRCG can be one of good choice for far-field VAD. However, the computation cost of MRCG is too high so that ,for real application, we do not use MRCG.

REF

[1] Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264.

seungwonpark commented 5 years ago

Thank you very much for sharing your insights! I'll also refer to that reference. Shall we close the issue? Surely, I won't mind if you leave it open.