Open glynpu opened 3 years ago
After fixing some bugs, wer on test-clean decrease from 3.97% to 3.32%, though it's still higher than espnet's 2.97%.
Great work!!
Sorry for my late response. I did not notice @glynpu's comment.
First, I just want to know whether the difference comes from the training part (probably so) or other parts.
Also, could you point out the main script to me? I want to check the overall training and inference flows and several hyper-parameters.
Thanks for your kindly help! @sw005320
I want to check the learning curve. Could you share it?
tensorboard for above screenshot
Did you compare the best validation accuracy (note that espnet uses the teacher forcing when computing the accuracy and it would be ~95.7% as written in the log)? We can compare them if we use the same (similar) BPE size.
Not yet. I will compare the differences between espnet and snowfall.
Also, could you point out the main script to me?I want to check the overall training and inference flows and several hyper-parameters.
Currently this pr is about training part. and #227 is focusing on decoding part.
For training, this shell scripts is the entrance.
export CUDA_VISIBLE_DEVICES="0, 1, 2, 3"
python3 bpe_ctc_att_conformer_train.py \
--bucketing-sampler True \
--lr-factor 10.0 \
--num-epochs 50 \
--full-libri True \
--max-duration 200 \
--concatenate-cuts False \
--world-size 4 \
> train_log.txt
Some hyper-parameters is hard-coded in bpe_ctc_att_conformer_train.py or constructor of class Conformer. Some of them are listed below: name | value |
---|---|
warmup_step | 40,000 |
att-rate | 0.7, i.e. ctc_weight=0.3 |
lsm_weight | 0.1 |
num_encoder_layerss | 12 |
num_decoder_layers | 6 |
nhead | 8 |
atten_dim | 512 |
For decoding part, the latest decoding implementations is #217, and I plan to port them to espnet after it's approved and finally merged into sowfall. The entrance of decoding is here
if [ $stage -le 3 ]; then
export CUDA_VISIBLE_DEVICES=2
python bpe_ctc_att_conformer_decode.py \
--max-duration=20 \
--generate-release-model=False \
--decode_with_released_model=True \
--num-paths-for-decoder-rescore=500
fi
A core function of decoding in bpe_ctc_att_conformer_decode.py is here
Thanks! Did you use model averaging? If so, how do you pick up models (best loss?), and how many?
Did you tune it? It may not have a big difference, but we usually pick up the 10-best models based on the validation accuracy for averaging.
Latest results are: | before | current |
---|---|---|
Encoder + ctc | 3.32 | 2.98( wer of espnet released model is 2.97/3.00) |
Encoder + TLG + 4-gram lattice rescore + nbest rescore with transformer decoder with log_semering=False and remove repeated tokens | 2.73 | 2.54 |
Result difference between current pr and espnet is sovled by tune training hyper-parameters with following modifications: | feat-norm | learning-factor | warm-up steps | epoch |
---|---|---|---|---|
before | no | 10 | 40,000 | 40 epoch (avg=10, with 26-35 epoch) |
current | yes | 5 | 80,000(around 10 epochs) | 50 epochs (avg=20 with 31-50 epochs) |
Reason of previous modifications are: I realized that in espnet, 1 epoch contains around 3000 batchs; however, in my implementation, with max_duration=200, one epoch contains 6000 batchs.
As a matter of experience, smaller batch_size is compatible with smaller learning rate, so half the learning rate. Since 1 epoch contains 6000 batchs ranther than 3000 batchs now, I doulbe warm-up steps.
The module feat_batch_norm also helps, resulting 3.32 --> 3.17.
As 35 epochs --> 50 eochs, I just set it arbitrarily to see what will happen with more epochs.
BTW, I failed to increase max_duration=200 because larger max_duration easily cause OOM. 200 seems the largest with my GPUs.
I feel that 80,000 warm-up steps are too large. It requires larger epochs to make training converged. I think you can find some optimal points with fewer warm-up steps and comparable performance.
Also, how about using the 3000 batches?
I feel that 80,000 warm-up steps are too large. It requires larger epochs to make training converged. I think you can find some optimal points with fewer warm-up steps and comparable performance.
80,000 is calculated from: around 10 epochs = 40,000/ 3000 (in espnet) = 80,000 / 6000(current pr)
Also, how about using the 3000 batches?
6000 batches --> 3000 batches means max_duration = 200 --> max_duration = 400; which will cause OOM in some batches. I am still analysising the reason.
As we mentioned in person, I believe a problem with the current setup is that the transformer loss is being normalized (divided by the minibatch size) twice, once in a library function and once in the training script, while the CTC loss is only normalized once. If we had logged the 2 objectives separately, we likely would have noticed this. I think that normalizing even once is not right, and that we should not normalize either of these objectives. The reason is that Librispeech has a wide range of durations, and the Lhotse sampler that we are using actually puts minibatches in bins where they have about the same duration (and approximately constant total duration in seconds), so in effect, right now we have a weight per frame that rises linearly with the sentence length. This will tend to cause convergence problems because longer sentences are harder to align. Removing the normalization (division by len(texts)) should not require changes to learning rates, because we are using Adam and there is no weight decay. [However, as a separate issue, I think we should experiment with a very small weight decay, which will cause the system to train/converge faster.]
FYI, espnet did not normalize the CTC and attention loss by the length.
Does anyone have any pointers to visualizations of the decoder attention in the application of transformers to ASR? I want to get a feel for how it works.
As metioned in #217,currently bpe training with ctcLoss and labelSmoothLoss in snowfall obtain higher wer than that of espnet.
The PROBLEM I am facing is:
Wer of snowfall trained models is still a little higher than the model of espnet trained, by 3.32% > 2.97%.(fixed by correcting datapreparation mistake)During espnet training: loss_att and loss_ctc always have the same order of magnitude, i.e. they decrease at the same pace.However during snowfall training, loss_att decrease sharply to even below 1.0 while loss_ctc keeps more than [30 to 100] times larger than loss_att.espnet training log file: https://github.com/glynpu/bpe_training_log_files/blob/master/espnet-egs2-librispeech-asr1-exp-asr_train_asr_conformer7_n_fft512_hop_length256_raw_en_bpe5000_sp-train.log
snowfall training log file of wer 3.97%: https://github.com/glynpu/bpe_training_log_files/blob/master/snowfall-egs-librispeech-asr-simple_v1-train_log.txtsnowfall training log file of wer 3.32% experiment: https://github.com/glynpu/bpe_training_log_files/blob/master/wer_3.32_June_26_snowfall_egs_librispeech-asr-simple_v1-train_log.txtWhat I have tried to make compariable between espnet and snowfall are: