.pt models obtained using pruned_transducer_stateless7/export.py have around 1% higher WER than the averaged .pt using pruned_transducer_stateless7/decode.py on librispeech

uni-manjunath-ke commented 1 year ago

Hi All and @csukuangfj , We have trained a custom English model using pruned_transducer_stateless7. We tried decoding librispeech clean test set using two methods using averaging of last 15 epoch.pt models (--avg=15, used in both cases).

We generated "pretrained.pt" model using pruned_transducer_stateless7/export.py, and then used this pretrained.pt model, and obtained a WER of 17.68%.
We decoded using pruned_transducer_stateless7/decode.py, and obtained a WER of 16.47%.

Just using export.py worsens the WERs by more than 1.2 %, which is very significant. I think this is a bug, and requires immediate attention. Could you please help on this. Thanks Thanks

csukuangfj commented 1 year ago

Could you show the complete command you used to produce pretrained.pt?

uni-manjunath-ke commented 1 year ago

./pruned_transducer_stateless7/export.py \ --exp-dir ./pruned_transducer_stateless7/exp \ --bpe-model data/lang_bpe_500/bpe.model \ --epoch 30 \ --avg 15

marcoyang1998 commented 1 year ago

Coudl you please check if you set --use-averaged-model to True when exporting the models?

uni-manjunath-ke commented 1 year ago

Hi, It is by default set to True in export.py. Pls check below snippet from export.py.

parser.add_argument(
    "--use-averaged-model",
    type=str2bool,
    default=True,
    help="Whether to load averaged model. Currently it only supports "
    "using --epoch. If True, it would decode with the averaged model "
    "over the epoch range from `epoch-avg` (excluded) to `epoch`."
    "Actually only the models with epoch number of `epoch-avg` and "
    "`epoch` are loaded for averaging. ",
)

Thanks

csukuangfj commented 1 year ago

How do you use pretrained.pt for decoding?

marcoyang1998 commented 1 year ago

@uni-manjunath-ke Sorry, I cannot reproduce your findings. I get exactly the same decoding results.

Here is what I did:

./pruned_transducer_stateless7/export.py
  --exp-dir ./pruned_transducer_stateless7/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --epoch 30 \
  --avg 15  \
 --use-averaged-model True 

cd ./pruned_transducer_stateless7/exp
ln -s pretrained.pt epoch-999.pt
cd ../..

./pruned_transducer_stateless7/decode.py
  --exp-dir ./pruned_transducer_stateless7/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --epoch 999 \
  --avg 1  \
 --use-averaged-model False

After doing this, I get the same decoding results with using --epoch 30 --avg 15 --use-averaged-model True.

Could you please show the decoding log if it still doesn't work for you?

uni-manjunath-ke commented 1 year ago

How do you use pretrained.pt for decoding?

Hi @csukuangfj , We use exactly similar to @marcoyang1998 mentioned in https://github.com/k2-fsa/icefall/issues/1024#issuecomment-1525912776

./pruned_transducer_stateless7/export.py --exp-dir ./pruned_transducer_stateless7/exp --bpe-model data/lang_bpe_500/bpe.model --epoch 30 --avg 15

cd ./pruned_transducer_stateless7/exp ln -s pretrained.pt epoch-3015.pt cd ../.. python3 ./pruned_transducer_stateless7/decode.py \ --decoding-method $decoding_method \ --manifest-dir $manifest_dir \ --cut-set-name $expt_name \ --use-averaged-model False \ --on-the-fly-feats True \ --bpe-model $bpe_model \ --max-duration 100 \ --exp $model_dir \ --num-workers 30 \ --epoch 3015 \ --avg 1

Thanks

uni-manjunath-ke commented 1 year ago

@uni-manjunath-ke Sorry, I cannot reproduce your findings. I get exactly the same decoding results.

Here is what I did:
./pruned_transducer_stateless7/export.py
  --exp-dir ./pruned_transducer_stateless7/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --epoch 30 \
  --avg 15  \
 --use-averaged-model True 

cd ./pruned_transducer_stateless7/exp
ln -s pretrained.pt epoch-999.pt
cd ../..

./pruned_transducer_stateless7/decode.py
  --exp-dir ./pruned_transducer_stateless7/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --epoch 999 \
  --avg 1  \
 --use-averaged-model False
After doing this, I get the same decoding results with using --epoch 30 --avg 15 --use-averaged-model True.

Could you please show the decoding log if it still doesn't work for you?

But, we again repeated the experiments and confirmed that there is a difference in the WERs as below: Method I: ./pruned_transducer_stateless7/export.py --exp-dir ./pruned_transducer_stateless7/exp --bpe-model data/lang_bpe_500/bpe.model --epoch 30 --avg 15

cd ./pruned_transducer_stateless7/exp ln -s pretrained.pt epoch-3015.pt cd ../.. python3 ./pruned_transducer_stateless7/decode.py \ --decoding-method $decoding_method \ --manifest-dir $manifest_dir \ --cut-set-name $expt_name \ --use-averaged-model False \ --on-the-fly-feats True \ --bpe-model $bpe_model \ --max-duration 100 \ --exp $model_dir \ --num-workers 30 \ --epoch 3015 \ --avg 1 This has WER of 17.72% with greedy_search. Method II: python3 ./pruned_transducer_stateless7/decode.py \ --decoding-method $decoding_method \ --manifest-dir $manifest_dir \ --cut-set-name $expt_name \ --use-averaged-model False \ --on-the-fly-feats True \ --bpe-model $bpe_model \ --max-duration 100 \ --exp $model_dir \ --num-workers 30 \ --epoch 30 \ --avg 15

Method 1 has WER of 17.72% whereas the Method 2 has WER of 16.48%, (both using greedy_search). Both the methods are using "--use-averaged-model False". Of course, if we pass "--use-averaged-model True" in Method 2, we are getting WER of 17.7%, which is same as Method 1 WER. But, we are interested in achieving lower WER similar to that of Method 2 (using Method 1).

We tried passing "--use-averaged-model True" with Method I, but it gives a error saying "3014.pt model" not found for averaging. So, Could you please suggest how to achieve lower WER (of 16.48%) using Method 1 (i.e. through export.py)

Thanks

desh2608 commented 1 year ago

If you want to get the same WER with method 1, you just need to export the model with --use-averaged-model False, like you are doing in Method 2.

In general, though, you can try averaging over fewer checkpoints with --use-averaged-model True (e.g., LibriSpeech uses 9) and see if that improves WER.

uni-manjunath-ke commented 1 year ago

ok Thanks @desh2608 . Will check and get back.

uni-manjunath-ke commented 1 year ago

--use-averaged-model False with export.py gives expected WER. Thanks for the suggestion @desh2608 .

However, it is little confusing that export.py has --use-averaged-model set to True by default, whereas the --use-averaged-model is set to False by default in decode.py. Is it planned to make it consistent across scripts in future releases? Thanks.

desh2608 commented 1 year ago

--use-averaged-model False with export.py gives expected WER. Thanks for the suggestion @desh2608 .

However, it is little confusing that export.py has --use-averaged-model set to True by default, whereas the --use-averaged-model is set to False by default in decode.py. Is it planned to make it consistent across scripts in future releases? Thanks.

I think it is True by default in both, at least for the LibriSpeech pruned_transducer_stateless7. See:

https://github.com/k2-fsa/icefall/blob/61ec3a7a8fc8be859b23a821e568950fb898b37a/egs/librispeech/ASR/pruned_transducer_stateless7/decode.py#L212

https://github.com/k2-fsa/icefall/blob/61ec3a7a8fc8be859b23a821e568950fb898b37a/egs/librispeech/ASR/pruned_transducer_stateless7/export.py#L137

Perhaps you changed something locally.

uni-manjunath-ke commented 1 year ago

Ya thank you very much. True that was a local edit.

k2-fsa / icefall

.pt models obtained using pruned_transducer_stateless7/export.py have around 1% higher WER than the averaged .pt using pruned_transducer_stateless7/decode.py on librispeech #1024