huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.75k stars 25.54k forks source link

AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id' #31348

Open rajanish4 opened 4 weeks ago

rajanish4 commented 4 weeks ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn", token=token) model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=token)

article = "Şeful ONU spune că nu există o soluţie militară în Siria" inputs = tokenizer(article, return_tensors="pt") translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30) tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Expected behavior

It should output translated text: UN-Chef sagt, es gibt keine militärische Lösung in Syrien

Complete error:

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30) AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id'

ArthurZucker commented 4 weeks ago

Yes, we had a deprecation cycle and this attribute was removed 😉

rajanish4 commented 4 weeks ago

Thanks, but then how can i provide the language code for translation?

ArthurZucker commented 4 weeks ago

you should simply do tokenizer.encode("deu_Latn")[0]

tokenizer-decode commented 1 week ago

Then why the doc says otherwise? This is V4.42.0. I also don't understand how to use tokenizer.encode("deu_Latn")[0]. What's the keyword? Is this a positional argument? @ArthurZucker

fe1ixxu commented 6 days ago

It seems there is an error: whatever the language code I gave to the NLLB tokenizer, it will always output English token id. My version is V4.42.3 @ArthurZucker :

image

ShayekhBinIslam commented 6 days ago

I think, tokenizer.encode("deu_Latn")[0] is the regular BOS token, tokenizer.encode("deu_Latn")[1] is the expected token. @ArthurZucker