how to finetune with k2 pretrained model to get a competable result?

nullscc commented 1 year ago

Reopen a issue to continue discuss in https://github.com/k2-fsa/icefall/issues/925 , as this is another topic.

For now, I have following finetuning result by directly load gigaspeech pretrained-iter-3488000-avg-15.pt, and finetuing on 2.4k hours data with /home/zhiwei001/icefall/egs/sgeng/ASR/pruned_transducer_stateless2/train.py without any modification:

| gigaspeech-test | testset2 | testset3 | testset4 | testset5 -- | -- | -- | -- | -- | -- wenet model finetune on gigaspeech | 17.68 | 15.26 | 9.8 | 11.6 | 8.51 epoch-0(from k2 gigaspeech pretrained model) | 10.52 | 44.02 | 68.02 | 53.67 | 23.21 epoch-3 | 52.14 | 32.3 | 21.7 | 25.68 | 14.8 epoch-9 | 42.08 | 26.12 | 18.29 | 20.88 | 12.81 epoch-12 | 47.51 | 29.79 | 19.89 | 23.09 | 14.09 epoch-19 | 38.59 | 25.54 | 17.79 | 20.18 | 12.37 epoch-28 | 39.22 | 24.41 | 17.29 | 19.36 | 12.22 epoch-32 | 37.4 | 23.56 | 16.85 | 19.14 | 12.14 epoch-35 | 39.03 | 23.6 | 16.96 | 19.36 | 12.19

And tensorboard result:

From the k2 pretrained model loss curve https://tensorboard.dev/experiment/zmmM0MLASnG1N2RmJ4MZBw/#scalars, is that normal? What I need to do is just to wait and the model will converge itself as time go?

nullscc commented 1 year ago

From https://github.com/k2-fsa/icefall/issues/925#issuecomment-1443810926 and https://github.com/k2-fsa/icefall/issues/925#issuecomment-1443892841, I get some finetune tricks if we want finetune on a k2 pretrained model(please fix me if I'm wrong):

start the learning rate at twice the .pt stored value or just keep the stored value unchanged(train.py will load it for now)
using a larger learning rate(same as stored value or twice) will tend to make the model noisy and "undo" previously learned things, but will also help you to learn your data quickly
dump the averaged-over-time version of the model and try to start from that point, it will be less noisy(i.e. load .pt saved with pruned_transducer_stateless4/train.py)

What I did is just load .pt stored learning rate and load from averaged-over-epoch version.

danpovey commented 1 year ago

I think the issue may be that the learning rate starts off way too high. That would be why the gigaspeech-test results are way worse on the initial iterations. It is forgetting what it learned originally. Look at the learning_rate plots in your screenshot. You need to start from a much lower learning rate, and use a different LR scheduler. In fact, I think it may be safer to start from the final learning rate from the .pt file, and decay it from there, by modifying the LR scheduler. The learning rate is set by the scheduler; just passing it into the optimizer will not necessarily have any effect. Look at the parts of the code that involve the scheduler. If you use a different scheduler it may have different usage; we use slightly non-standard schedulers, but all the code involving them is in train.py so it should be easy to change. Main thing to watch for is that some schedulers have step() once per epoch, and some once per minibatch, so you need to watch out for when you are calling step(). You can check the log files or the stderr output for what learning rate it is actually using; that is printed out.

nullscc commented 1 year ago

I think the issue may be that the learning rate starts off way too high. That would be why the gigaspeech-test results are way worse on the initial iterations. It is forgetting what it learned originally. Look at the learning_rate plots in your screenshot. You need to start from a much lower learning rate, and use a different LR scheduler. In fact, I think it may be safer to start from the final learning rate from the .pt file, and decay it from there, by modifying the LR scheduler. The learning rate is set by the scheduler; just passing it into the optimizer will not necessarily have any effect. Look at the parts of the code that involve the scheduler. If you use a different scheduler it may have different usage; we use slightly non-standard schedulers, but all the code involving them is in train.py so it should be easy to change. Main thing to watch for is that some schedulers have step() once per epoch, and some once per minibatch, so you need to watch out for when you are calling step(). You can check the log files or the stderr output for what learning rate it is actually using; that is printed out.

Thank you. I realized that I've made a stupid mistake as you said. I ignore the lr loss value printed on the log and tensorboard graph.

Now I've trained it start from lr: 0.0001 which is twice lr of final .pt by setting initial-lr to 0.0001 and leting train.py not to load batch_idx_train value (which is the value scheduler will step to and influence learning rate). Hope it'll have a good result.

marcoyang1998 commented 1 year ago

Hi @nullscc , we created an example fine-tune script in this #944 . The pipeline works well for LibriSpeech -> GigaSpeech. You may have a look.

nullscc commented 1 year ago

@marcoyang1998 Thank you. I'll take it a try, since I still couldn't get a good result on my data. And get a very bad result on gigaspeech dataset.

nullscc commented 1 year ago

@marcoyang1998 I'm running your finetune script by modifing something as below(because I need to finetune on gigaspeech prune_transducer_stateless2  pretrained model and there's no pretrained gigaspeech zipformer model) :

change zipformer to conformer(by replacing decoder.py, model.py, joiner.py accordingly)
change compute_loss according to latest librispeech/pruned_transducer_stateless2/train.py (the main difference is warmup param there.)

Am I doing the right thing?

marcoyang1998 commented 1 year ago

It seems correct.

I would recommend you write a finetune.py under pruned_transducer_stateless2. In this way, you won't need to remove any files. You can compare pruned_transducer_stateless7/finetune.py and pruned_transducer_stateless7/train.py in the LibriSpeech recipe to have some understanding of how to only load the model parameters. This will help you write your own finetune.py.

nullscc commented 1 year ago

It seems correct.

I would recommend you write a finetune.py under pruned_transducer_stateless2. In this way, you won't need to remove any files. You can compare pruned_transducer_stateless7/finetune.py and pruned_transducer_stateless7/train.py in the LibriSpeech recipe to have some understanding of how to only load the model parameters. This will help you write your own finetune.py.

I just copy whole pruned_transducer_stateless7 dir and only modify nessesary file to get it work on conformer and my own dataset.

You mentioned "only load the model parameters", is that point to the method load_model_params? If one want to finetune, should he/she need to load scheduler or optimizer? I noticed that your script will still load scheduler and optimizer. (it has no effect on your experiments because there is no optimizer and scheduler state stored on the model you loaded for finetune).

marcoyang1998 commented 1 year ago

If one want to finetune, should he/she need to load scheduler or optimizer?

We recommend only load model parameters. This is done via load_model_params.

We won’t load optimizer or scheduler from the finetune ckpt because the return value of load_model_params is None.

It will only load optimizer and scheduler if you stop the finetune process and want to resume from an epoch. In this case, you need to set do-finetune=False.

nullscc commented 1 year ago

@marcoyang1998 Oh, I see, I misunderstand that.

k2-fsa / icefall

how to finetune with k2 pretrained model to get a competable result? #926