Closed NAM-hj closed 4 years ago
I also have this problem using a different training dataset when training small dataset for experiment, I found loss about 1.023, with acc about 0.94. When using utils/decoding.py to decode wavs, I found that the decoding process always reach the end to early. For example, when total length of the encoder output is about 120, the process always stop at time step 10~20. And every step, the predicted output is probably not blank(''), thus the output string increased every time step. The right decoded output should be like this: I blank blank blank am blank blank blank a blank blank blank student but now my decoded output is like this: I am a blank student I think the blank output is not working as expected for spliting the outputs and aligning with the wav, and the output probability is mainly decided by the prediction_network.
@nambee Thanks for the info, I have been having issues as well with training and inference. How many epochs did you train for on your test to get the loss of ~0.2? (Also what hparams did you use?) I am trying to run a similar test to debug the issue
I set max epoch as 1000. And This is the generated "hparam.json" file.
{"token_type": "word-piece", "mel_bins": 80, "frame_length": 0.025, "frame_step": 0.01, "hertz_low": 125.0, "hertz_high": 7600.0, "embedding_size": 384, "encoder_layers": 3, "encoder_size": 2048, "projection_size": 640, "time_reduction_index": 1, "time_reduction_factor": 2, "pred_net_layers": 2, "pred_net_size": 2048, "joint_net_size": 640, "learning_rate": 0.0001, "vocab_size": 4088}
@nambee I ran a test similar to yours and it seems as though the problem is resolved in the latest commit. I was able to train with a dataset of 4-5 audio samples and have the inference script predict the correct transcription.
Thank you for your nice code.
I try to train the model with very small dataset to see the feasibility. I reduced dataset to only 4 sentence. (common_voice_en_19664034,19664035,19664037,19664038.wav)
The loss and WER and CER are goes to zero.
After I train the model, I run the inference with the same waves (transcribe_file.py).
python transcribe_file.py --checkpoint ~~.hdf5 --i ~~.wav
Can you give me some advice? Why I cannot get proper outputs with a small dataset which well trained.
Thank you