huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.4k stars 25.67k forks source link

Some community models are broken and can't be downloaded #3359

Closed patrickvonplaten closed 4 years ago

patrickvonplaten commented 4 years ago

πŸ› Bug

Information

Model I am using (Bert, XLNet ...): Community Models

Language I am using the model on (English, Chinese ...): Multiple different ones

Quite some community models can't be loaded. The stats are here:

Stats

  1. 68 can't load either their config (n)or their tokenizer:

    • a) 34 models can't even load their config file. The reasons for this are either:

      • i. 11/34: Model identifier is wrong, e.g. albert-large does not exist anymore, it seems like it was renamed to albert-large-v1. These models have saved the another name online than how it is saved on AWS.

      • ii. 23/34: There is an unrecognized model_type in the config.json, e.g.

        "Error: Message: Unrecognized model in hfl/rbtl3. Should have a model_type key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "

    • b) 33 models can load their config, but cannot load their tokenizers. The error message is almost always the same:

TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name 'clue/roberta_chinese_base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'clue/roberta_chinese_base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

  1. For 162 models everything is fine!

Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here

HOW-TO-FIX-STEPS (in the following order):

julien-c commented 4 years ago
albert-base
albert-large
albert-xlarge
albert-xxlarge
bert-base-multilingual-cased-finetuned-conll03-dutch
bert-base-multilingual-cased-finetuned-conll03-spanish
mlm-100-1280
mlm-17-1280
bertabs-finetuned-cnndm-extractive-abstractive-summarization
bertabs-finetuned-extractive-abstractive-summarization
bertabs-finetuned-xsum-extractive-abstractive-summarization
patrickvonplaten commented 4 years ago

UPDATE:

Stats

  1. 61 can't load either their config (n)or their tokenizer:

    • a) 23 models can't load their config file. The reasons for this are as follows:There is an unrecognized model_type in the config.json, e.g.

      "Error: Message: Unrecognized model in hfl/rbtl3. Should have a model_type key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "

    • b) 38 models can load their config, but cannot load their tokenizers. The error message is always the same:

TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name 'clue/roberta_chinese_base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'clue/roberta_chinese_base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

  1. For 254 models everything is fine!

Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here

NEXT STEPS

1a) and 1b) cannot really be fixed by us because for 1a) we don't know which model_type is used and for 1b) if the tokenizer does not work or does not exist it should be fixed or uploaded by the author. These 61 models can probably still be used if the correct model class is used instead of AutoModel.from_pretrained(...)

We could contact the authors or add a warning sign to the model page.

liuchenbaidu commented 3 years ago

the problem of denpa92/bert-base-cantonese is not solved.

patrickvonplaten commented 3 years ago

hey @liuchenbaidu , I'd recommend contacting the author of the model in this case.

XiangQinYu commented 3 years ago

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.

drussellmrichie commented 2 years ago

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.

Hi, @XiangQinYu. I'm a bit of a newbie with Huggingface. Can you say more about how you did this? I guess you mean adding "model_type":"bert" to a file like this. But how did you edit the file? Did you download the whole model repository, and edit and run it locally?

EDIT: Nevermind, figured it out with help of a commenter on a question I asked on SO.