data format - Githubissues

jtkim-kaist / VAD

Voice activity detection (VAD) toolkit including DNN, bDNN, LSTM and ACAM based VAD. We also provide our directly recorded dataset.

842 stars 235 forks source link

data format #6

Open pankaj2701 opened 6 years ago

pankaj2701 commented 6 years ago

I have not been able to understand the way training data should be specified. Like how the labels should be written. Do we need to specify the time at which labels occur in the sound file. If yes how and where

jtkim-kaist commented 6 years ago

Sorry for absent of detail description of specification of training dataset,

But, it is very simple. You can find the dataformat by investigating the data in /data/raw/train or /data/raw/valid

The speech data should be .wav file whose sampling rate at 16khz and the label must be .mat file whose have 1 dimension and the values are just 1 (if speech) or 0 (if non-speech). For the direct understanding, plz open the sample training data in /data/raw/train

Thx!

pankaj2701 commented 6 years ago

Thanks for the quick reply. I still have one doubt. While marking the labels do we have to count the overlapping frames or non overlapping

jtkim-kaist commented 6 years ago

You don't have to conduct framing on the label. The needed label is just sample based label.

For example if speech signal has 10,000 samples. The label also should have 10,000 samples.

Please download our sample wav & label and verify these.

Thx!

pankaj2701 commented 6 years ago

The reason I am asking the question is because I want to train it on my data. So I need to know how to prepare the training data.

I saw the sample files given but it is not very clear how the samples have been labeled. Some samples are are having a value of zero and some are having a value of 1. I guess the value of 1 means that corresponding sample is a speech sample. But I have not been able to visually correlate the sample numbers with the waveforms.

jtkim-kaist commented 6 years ago

Your guess is correct, the 1 corresponds to speech and 0 corresponds to the non-speech the plot is like as below:

untitled

Note that if the speech data has noise, it is hard to discriminate speech or non-speech visually in 1d signal domain.

Chenny0808 commented 5 years ago

hello，i have the same problem with you. now (1)do you konw the method of formating the train data？ (2)i dont konw that how do the one label of a mat file correspond with the wav file？