Adding tokens to pretrained model "Helsinki-NLP/opus-tatoeba-en-ja" using tokens from vietnamese not working

Hello, I am working on a code of the paper regarding multilingual model with multistage finetuning, I am using trainer API of huggingface for the finetuning of pretrained model from english to japanese with dataset containing vietnamese sentences, but before I want to modify the pretrained tokenizer adding tokens from other pretrained model tokenizer that recognize vietnamese

model_checkpoint = "Helsinki-NLP/opus-tatoeba-en-ja" model_mbart = "facebook/mbart-large-50-one-to-many-mmt"

mbart_tokenizer=MBart50Tokenizer.from_pretrained(model_mbart) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer.add_tokens([vocab for vocab in mbart_tokenizer.get_vocab().keys()) model.resize_token_embeddings(len(tokenizer))

Before modifying the tokenizer, when tokenizing a vietnamese sentence i got 72 then i try to tokenize a vietnamese sentence with the modified tokenizer i got 66 tokens whereas using mbart50 i get 24 . Can you please tell me what I am doing wrong??

huggingface / transformers

Adding tokens to pretrained model "Helsinki-NLP/opus-tatoeba-en-ja" using tokens from vietnamese not working #14901