WIP: BPE Training ctc loss and label smooth loss

glynpu commented 3 years ago

As metioned in #217，currently bpe training with ctcLoss and labelSmoothLoss in snowfall obtain higher wer than that of espnet.

decoding algorithm	training tool	encoder + k2 ctc decode+no rescore
k2 ctc decode in #217	espnet	2.97
k2 ctc decode in #217	snowfall	3.97 (updated on June 24 with wrong data preparation)
k2 ctc decode in #217	snowfall	3.32 (updated on June 28 by fixing data preparation mistake)
k2 ctc decode in #217	snowfall	2.98 (updated on July 21 by adding feature batch_norm and tune training hyper-parameters)

avg (10epochs, 26-35 epochs)
INFO:root:[test-clean] %WER 3.33% [1749 / 52576, 171 ins, 140 del, 1438 sub ]
INFO:root:[test-other] %WER 8.06% [4218 / 52343, 397 ins, 403 del, 3418 sub ]
avg (10 epochs, 29 - 38 epochs)
INFO:root:[test-clean] %WER 3.32% [1744 / 52576, 184 ins, 136 del, 1424 sub ]
INFO:root:[test-other] %WER 7.96% [4167 / 52343, 402 ins, 367 del, 3398 sub ]

The PROBLEM I am facing is: ~~Wer of snowfall trained models is still a little higher than the model of espnet trained, by 3.32% > 2.97%.~~ (fixed by correcting datapreparation mistake)~~During espnet training: loss_att and loss_ctc always have the same order of magnitude, i.e. they decrease at the same pace.~~ ~~However during snowfall training, loss_att decrease sharply to even below 1.0 while loss_ctc keeps more than [30 to 100] times larger than loss_att.~~

espnet training log file: https://github.com/glynpu/bpe_training_log_files/blob/master/espnet-egs2-librispeech-asr1-exp-asr_train_asr_conformer7_n_fft512_hop_length256_raw_en_bpe5000_sp-train.log ~~snowfall training log file of wer 3.97%: https://github.com/glynpu/bpe_training_log_files/blob/master/snowfall-egs-librispeech-asr-simple_v1-train_log.txt~~ snowfall training log file of wer 3.32% experiment: https://github.com/glynpu/bpe_training_log_files/blob/master/wer_3.32_June_26_snowfall_egs_librispeech-asr-simple_v1-train_log.txt

What I have tried to make compariable between espnet and snowfall are:

model structures: by loading espnet released models into snowfall successfully(using regular expression to chage key's name in state_dict, similar to this), I believe model structures are identical except parameter names.
loss functions: both use torch.nn.CTCLoss and torch.nn.KLDivLoss
normalization: both loss are normalized by batch_size in espnet and snowfall
learning rate schedule: espnet use WarmupLR, while snowfall use Noam. WramupLR(optimizer.lr=0.0025, warmup_steps=40000) and Noam(model_size=512, factor=10.0, warm_step=40000) are quite similar, though not 100 percent identical.
each batch contains utts in silimar duration: espnet use NumElementBatchSampler and snowfall use BucketingSampler
token_ids: in total 5000 tokens. After spm tokenizer is trained, are removed while other 4997 tokens are kept. Then three tokens, "blank_id = 0; oov_id = 1; sos_eos_id=4999", are manully added(reference).

glynpu commented 3 years ago

After fixing some bugs, wer on test-clean decrease from 3.97% to 3.32%, though it's still higher than espnet's 2.97%.

danpovey commented 3 years ago

Great work!!

sw005320 commented 3 years ago

Sorry for my late response. I did not notice @glynpu's comment.

First, I just want to know whether the difference comes from the training part (probably so) or other parts.

I want to check the learning curve. Could you share it?
Did you compare the best validation accuracy (note that espnet uses the teacher forcing when computing the accuracy and it would be ~95.7% as written in the log)? We can compare them if we use the same (similar) BPE size.

Also, could you point out the main script to me? I want to check the overall training and inference flows and several hyper-parameters.

glynpu commented 3 years ago

Thanks for your kindly help! @sw005320

I want to check the learning curve. Could you share it?

tensorboard for above screenshot

Did you compare the best validation accuracy (note that espnet uses the teacher forcing when computing the accuracy and it would be ~95.7% as written in the log)? We can compare them if we use the same (similar) BPE size.

Not yet. I will compare the differences between espnet and snowfall.

Also, could you point out the main script to me?I want to check the overall training and inference flows and several hyper-parameters.

Currently this pr is about training part. and #227 is focusing on decoding part.

For training, this shell scripts is the entrance.

  export CUDA_VISIBLE_DEVICES="0, 1, 2, 3"
  python3 bpe_ctc_att_conformer_train.py \
    --bucketing-sampler True \
    --lr-factor 10.0 \
    --num-epochs 50 \
    --full-libri True \
    --max-duration 200 \
    --concatenate-cuts False \
    --world-size 4 \
    > train_log.txt

Some hyper-parameters is hard-coded in bpe_ctc_att_conformer_train.py or constructor of class Conformer. Some of them are listed below: name	value
warmup_step	40,000
att-rate	0.7, i.e. ctc_weight=0.3
lsm_weight	0.1
num_encoder_layerss	12
num_decoder_layers	6
nhead	8
atten_dim	512

For decoding part, the latest decoding implementations is #217, and I plan to port them to espnet after it's approved and finally merged into sowfall. The entrance of decoding is here

if [ $stage -le 3 ]; then
  export CUDA_VISIBLE_DEVICES=2
  python bpe_ctc_att_conformer_decode.py \
    --max-duration=20 \
    --generate-release-model=False \
    --decode_with_released_model=True \
    --num-paths-for-decoder-rescore=500
fi

A core function of decoding in bpe_ctc_att_conformer_decode.py is here

sw005320 commented 3 years ago

Thanks! Did you use model averaging? If so, how do you pick up models (best loss?), and how many?

sw005320 commented 3 years ago

I found it https://github.com/k2-fsa/snowfall/pull/217/files#diff-fd4e35e8e4b530ddf5ca285f24f2f92dfb6a0db691e75b4efe1dc59309654883R146-R151

Did you tune it? It may not have a big difference, but we usually pick up the 10-best models based on the validation accuracy for averaging.

glynpu commented 2 years ago

Latest results are:	before	current
Encoder + ctc	3.32	2.98( wer of espnet released model is 2.97/3.00)
Encoder + TLG + 4-gram lattice rescore + nbest rescore with transformer decoder with log_semering=False and remove repeated tokens	2.73	2.54

Result difference between current pr and espnet is sovled by tune training hyper-parameters with following modifications:	feat-norm	learning-factor	warm-up steps	epoch
before	no	10	40,000	40 epoch (avg=10, with 26-35 epoch)
current	yes	5	80,000(around 10 epochs)	50 epochs (avg=20 with 31-50 epochs)

Reason of previous modifications are: I realized that in espnet, 1 epoch contains around 3000 batchs; however, in my implementation, with max_duration=200, one epoch contains 6000 batchs.

As a matter of experience, smaller batch_size is compatible with smaller learning rate, so half the learning rate. Since 1 epoch contains 6000 batchs ranther than 3000 batchs now, I doulbe warm-up steps.

The module feat_batch_norm also helps, resulting 3.32 --> 3.17.

As 35 epochs --> 50 eochs, I just set it arbitrarily to see what will happen with more epochs.

BTW, I failed to increase max_duration=200 because larger max_duration easily cause OOM. 200 seems the largest with my GPUs.

sw005320 commented 2 years ago

I feel that 80,000 warm-up steps are too large. It requires larger epochs to make training converged. I think you can find some optimal points with fewer warm-up steps and comparable performance.

Also, how about using the 3000 batches?

glynpu commented 2 years ago

I feel that 80,000 warm-up steps are too large. It requires larger epochs to make training converged. I think you can find some optimal points with fewer warm-up steps and comparable performance.

80,000 is calculated from: around 10 epochs = 40,000/ 3000 (in espnet) = 80,000 / 6000(current pr)

Also, how about using the 3000 batches?

6000 batches --> 3000 batches means max_duration = 200 --> max_duration = 400; which will cause OOM in some batches. I am still analysising the reason.

danpovey commented 2 years ago

As we mentioned in person, I believe a problem with the current setup is that the transformer loss is being normalized (divided by the minibatch size) twice, once in a library function and once in the training script, while the CTC loss is only normalized once. If we had logged the 2 objectives separately, we likely would have noticed this. I think that normalizing even once is not right, and that we should not normalize either of these objectives. The reason is that Librispeech has a wide range of durations, and the Lhotse sampler that we are using actually puts minibatches in bins where they have about the same duration (and approximately constant total duration in seconds), so in effect, right now we have a weight per frame that rises linearly with the sentence length. This will tend to cause convergence problems because longer sentences are harder to align. Removing the normalization (division by len(texts)) should not require changes to learning rates, because we are using Adam and there is no weight decay. [However, as a separate issue, I think we should experiment with a very small weight decay, which will cause the system to train/converge faster.]

sw005320 commented 2 years ago

FYI, espnet did not normalize the CTC and attention loss by the length.

danpovey commented 2 years ago

Does anyone have any pointers to visualizations of the decoder attention in the application of transformers to ASR? I want to get a feel for how it works.

k2-fsa / snowfall

WIP: BPE Training ctc loss and label smooth loss #219