k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
802 stars 270 forks source link

Extend tokens.txt with new tokens on pretrained model #1552

Closed gorosei-dev closed 2 months ago

gorosei-dev commented 2 months ago

Suppose I want to further train the pretrained model on more data, but the new data contains some new tokens that are not covered in the tokens.txt / bpe.model, and I want the new model to be able to recognize these new tokens, how can I achieve this without retraining from scratch?

JinZr commented 2 months ago

You can reuse all parameters of your pre-trained model except for the output layer part, also remember to modify the lang_dir you are using for the later fine-tuning