huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

Blenderbot-3B config seems to be a little wrong #9357

Closed Narsil closed 3 years ago

Narsil commented 3 years ago

Environment info

It seems the current Config of Blenderbot-3B is a bit broken, (Blenderbot-90M and distill versions seem fine).


tokenizer = AutoTokenizer.from_pretrained('facebook/blenderbot-90M')
tokenizer.decode(tokenizer.encode("Hey there"))
# 'hey there'  so working fine

tokenizer.decode(tokenizer.encode("Hey there"))
# '<unk> y <unk> e'   obvious error as the tokens as 'ĠHey' exists in the vocab. Error is possibly linked to '@@' string terminator config

----
# Other example that's probably linked but that originally triggered the issue so we need to make sure it's fixed too

nlp = pipeline('text-generation', model='blenderbot-3B')
nlp("Hey there")
# {"generated_text": "'ĠHi, Ġhow Ġare Ġyou Ġtoday? ĠI Ġjust Ġgot Ġback Ġfrom Ġa Ġwalk, Ġit Ġwas Ġnice."}

Who can help

@patrickvonplaten @patil-suraj

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

Expected behavior

The tokenization should be better at encoding for 3B. And the pipeline should not output garbage Ġ everywhere.

github-actions[bot] commented 3 years ago

This issue has been stale for 1 month.

Narsil commented 3 years ago

Closing this, blenderbot 90M is very different in Arch as other variants, so it will receive less love (it's not that powerful compared to the others anyway).

Also a lot of work was done here : https://github.com/huggingface/transformers/pull/10002