Inference problem with small dataset

noahchalifour / rnnt-speech-recognition

End-to-end speech recognition using RNN Transducers in Tensorflow 2.0

MIT License

242 stars 79 forks source link

Inference problem with small dataset #23

Closed NAM-hj closed 4 years ago

NAM-hj commented 4 years ago

Thank you for your nice code.

I try to train the model with very small dataset to see the feasibility. I reduced dataset to only 4 sentence. (common_voice_en_19664034,19664035,19664037,19664038.wav)

The loss and WER and CER are goes to zero.

training loss decreased to 0.1945
WER and CER goes to zero

After I train the model, I run the inference with the same waves (transcribe_file.py). python transcribe_file.py --checkpoint ~~.hdf5 --i ~~.wav

But I only get the wrong answer.

(When I use ~19664035, ~19664037, ~19664038 .wav): return Transcription of ~34.wav

(When I use ~19664034.wav): return Transcript of ~38.wav

Can you give me some advice? Why I cannot get proper outputs with a small dataset which well trained.

Thank you

ybNo1 commented 4 years ago

I also have this problem using a different training dataset when training small dataset for experiment, I found loss about 1.023, with acc about 0.94. When using utils/decoding.py to decode wavs, I found that the decoding process always reach the end to early. For example, when total length of the encoder output is about 120, the process always stop at time step 10~20. And every step, the predicted output is probably not blank(''), thus the output string increased every time step. The right decoded output should be like this: I blank blank blank am blank blank blank a blank blank blank student but now my decoded output is like this: I am a blank student I think the blank output is not working as expected for spliting the outputs and aligning with the wav, and the output probability is mainly decided by the prediction_network.

noahchalifour commented 4 years ago

@nambee Thanks for the info, I have been having issues as well with training and inference. How many epochs did you train for on your test to get the loss of ~0.2? (Also what hparams did you use?) I am trying to run a similar test to debug the issue

NAM-hj commented 4 years ago

I set max epoch as 1000. And This is the generated "hparam.json" file.

{"token_type": "word-piece", "mel_bins": 80, "frame_length": 0.025, "frame_step": 0.01, "hertz_low": 125.0, "hertz_high": 7600.0, "embedding_size": 384, "encoder_layers": 3, "encoder_size": 2048, "projection_size": 640, "time_reduction_index": 1, "time_reduction_factor": 2, "pred_net_layers": 2, "pred_net_size": 2048, "joint_net_size": 640, "learning_rate": 0.0001, "vocab_size": 4088}

noahchalifour commented 4 years ago

@nambee I ran a test similar to yours and it seems as though the problem is resolved in the latest commit. I was able to train with a dataset of 4-5 audio samples and have the inference script predict the correct transcription.