Open rajanish4 opened 4 weeks ago
Yes, we had a deprecation cycle and this attribute was removed 😉
Thanks, but then how can i provide the language code for translation?
you should simply do tokenizer.encode("deu_Latn")[0]
Then why the doc says otherwise? This is V4.42.0
.
I also don't understand how to use tokenizer.encode("deu_Latn")[0]
. What's the keyword? Is this a positional argument? @ArthurZucker
It seems there is an error: whatever the language code I gave to the NLLB tokenizer, it will always output English token id. My version is V4.42.3
@ArthurZucker :
I think, tokenizer.encode("deu_Latn")[0]
is the regular BOS token, tokenizer.encode("deu_Latn")[1]
is the expected token. @ArthurZucker
System Info
transformers
version: 4.42.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn", token=token) model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=token)
article = "Şeful ONU spune că nu există o soluţie militară în Siria" inputs = tokenizer(article, return_tensors="pt") translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30) tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Expected behavior
It should output translated text: UN-Chef sagt, es gibt keine militärische Lösung in Syrien
Complete error: