Accuracy during test time is 10%

minghau commented 5 years ago

Hello, I followed all the instructions (put aside the fact that the tutorials are not updated), trained a GRU model for 30 and 150 epochs from scratch, however, during the test phase the accuracy is around 10%.

Sample output is as follows:

FAILED 10.09%: expected stop, got bird in stop/e1469561_nohash_3.wav
FAILED 10.08%: expected bed, got up in bed/cc592808_nohash_3.wav
FAILED 10.07%: expected left, got up in left/d5ca80c6_nohash_0.wav
FAILED 10.05%: expected wow, got nine in wow/bed06fac_nohash_1.wav
FAILED 10.04%: expected cat, got up in cat/b97c9f77_nohash_0.wav

During training the accuracy is good:

Epoch 10, Loss 0.576, Validation Accuracy 80.646, Learning Rate 0.001
Epoch 11, Loss 0.501, Validation Accuracy 80.557, Learning Rate 0.001
Epoch 12, Loss 0.685, Validation Accuracy 85.495, Learning Rate 0.001
Epoch 13, Loss 0.532, Validation Accuracy 86.896, Learning Rate 0.001
Epoch 14, Loss 0.358, Validation Accuracy 86.483, Learning Rate 0.001
Epoch 15, Loss 0.470, Validation Accuracy 84.522, Learning Rate 0.001
Epoch 16, Loss 0.334, Validation Accuracy 86.483, Learning Rate 0.001
Epoch 17, Loss 0.424, Validation Accuracy 87.367, Learning Rate 0.001
Epoch 18, Loss 0.266, Validation Accuracy 83.550, Learning Rate 0.001

The script I used is:

# train
python3 tools/utilities/pythonlibs/audio/training/train_classifier.py --architecture GRU --use_gpu --outdir=out --dataset=audio --epochs=150

# import
python3  $ELL_ROOT/tools/importers/onnx/onnx_import.py out/GRU128KeywordSpotter.onnx

# generate ell
python3 tools/wrap/wrap.py --model_file out/GRU128KeywordSpotter.ell --outdir KeywordSpotter --module_name model

# compile and make spotter
cd KeywordSpotter
mkdir build
cd build
cmake ..
make
cd ..
cd ..

#python3 tools/utilities/pythonlibs/audio/training/train_classifier.py --architecture GRU --use_gpu --outdir=out --dataset=audio --epochs=150

# test
./build/tools/utilities/pythonlibs/audio/training/test_ell_model.py --classifier KeywordSpotter/model --featurizer compiled_featurizer/mfcc --sample_rate 1600$

Thanks

lovettchris commented 5 years ago

Thanks for the bug report, a difference in accuracy can be the result of not using the exact same featurizer at runtime that you used to featurize the training data. Can you include the command line you used with "make_featurizer" and "make_dataset", it also looks like your test_ell_model.py command line got truncated in your output above. I just ran my own test with the attached scripts and it seems to work fine, with a the final pass rate of 92.27 %

train.zip

minghau commented 5 years ago

Thank you so much for your prompt reply. Indeed, there were probably issues regarding the folders I used, since I compiled several times. Anyway, for future reference this is the script for Linux:

python tools/utilities/pythonlibs/audio/training/make_featurizer.py --sample_rate 16000 --window_size 400 --input_buffer_size 160 --nfft 512 --filterbank_type mel --filterbank_$

python tools/wrap/wrap.py --model_file featurizer.ell --module_name mfcc --outdir compiled_featurizer

python tools/utilities/pythonlibs/audio/training/make_dataset.py --outdir compiled_featurizer --categories categories.txt --featurizer compiled_featurizer/mfcc --window_size 10$
python tools/utilities/pythonlibs/audio/training/make_dataset.py --outdir compiled_featurizer --categories categories.txt --featurizer compiled_featurizer/mfcc --window_size 10$
python tools/utilities/pythonlibs/audio/training/make_dataset.py --outdir compiled_featurizer --categories categories.txt --featurizer compiled_featurizer/mfcc --window_size 10$

python tools/utilities/pythonlibs/audio/training/train_classifier.py --architecture GRU --epochs 30 --num_layers 2 --hidden_units 128 --use_gpu --dataset compiled_featurizer --$

python tools/importers/onnx/onnx_import.py GRU128KeywordSpotter.onnx

python tools/wrap/wrap.py --model_file GRU128KeywordSpotter.ell --outdir KeywordSpotter --module_name model

python tools/utilities/pythonlibs/audio/training/test_ell_model.py --featurizer compiled_featurizer/mfcc --classifier KeywordSpotter/model  --list_file audio/testing_list.txt -$

BTW, any suggestion regarding training for Chinese? I saw that the number of audio samples per keyword is approximately 2000. Anything else I have to do except for collecting 30 words in Chinese?

Thanks !

lovettchris commented 5 years ago

Shouldn't matter which language the keywords are spoken in, could probably even do multiple languages if you had enough recordings, but of course, put each language in it's own folder, don't mix them. But yes, the key to any deep neural network is lots and lots of clean data. You can increase the size of the dataset by mixing low volume random noise background sounds. make_dataset.py has some options along these lines. Happy training!

lovettchris commented 3 years ago

Also your call to test_ell_model.py might be missing the --auto_scale option - depending on whether you specified this in the calle to make_featurizer.

microsoft / ELL

Accuracy during test time is 10% #238