google-research / multilingual-t5

Apache License 2.0
1.25k stars 129 forks source link

Clean model from hugging face returning only `<pad> <extra_id_0>.</s>'` #116

Closed spagnoloG closed 1 year ago

spagnoloG commented 1 year ago

When I try to load a model from hugging face I get this kind of response. What am I doning wrong? Or is something wrong with the uploaded model to hugging face? I wanted to fine tune it but I cannot becouse it keeps returning this string.

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
/home/gasperspagnolo/Documents/stuff/testing/.venv/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
<pad> <extra_id_0>.</s>

Here is my mini sample code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")
prompt = "Can you translate the following paragraph into French? 'Climate change is one of the most significant challenges facing humanity. Its effects on agriculture are particularly concerning, as they threaten our ability to feed a growing population.'"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=100)
decoded_output = tokenizer.decode(output[0])
print(decoded_output)
spagnoloG commented 1 year ago

duplicate: https://github.com/huggingface/transformers/issues/8704