Closed mhmd-mst closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hello, I am working on a code of the paper regarding multilingual model with multistage finetuning, I am using trainer API of huggingface for the finetuning of pretrained model from english to japanese with dataset containing vietnamese sentences, but before I want to modify the pretrained tokenizer adding tokens from other pretrained model tokenizer that recognize vietnamese
model_checkpoint = "Helsinki-NLP/opus-tatoeba-en-ja" model_mbart = "facebook/mbart-large-50-one-to-many-mmt"
mbart_tokenizer=MBart50Tokenizer.from_pretrained(model_mbart) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.add_tokens([vocab for vocab in mbart_tokenizer.get_vocab().keys()) model.resize_token_embeddings(len(tokenizer))
Before modifying the tokenizer, when tokenizing a vietnamese sentence i got 72 then i try to tokenize a vietnamese sentence with the modified tokenizer i got 66 tokens whereas using mbart50 i get 24 . Can you please tell me what I am doing wrong??