patrickvonplaten commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Community Models

Language I am using the model on (English, Chinese ...): Multiple different ones

Quite some community models can't be loaded. The stats are here:

Stats

68 can't load either their config (n)or their tokenizer:
- a) 34 models can't even load their config file. The reasons for this are either:
  - i. 11/34: Model identifier is wrong, e.g. albert-large does not exist anymore, it seems like it was renamed to albert-large-v1. These models have saved the another name online than how it is saved on AWS.
  - ii. 23/34: There is an unrecognized model_type in the config.json, e.g.
    
    "Error: Message: Unrecognized model in hfl/rbtl3. Should have a model_type key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "
- b) 33 models can load their config, but cannot load their tokenizers. The error message is almost always the same:

TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name 'clue/roberta_chinese_base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'clue/roberta_chinese_base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

i. Here: the model has neither of:
- vocab_file
- added_tokens_file
- special_tokens_map_file
- tokenizer_config_file
1. 79 currently have wrong pad_token_id, eos_token_id, bos_token_id in their configs. IMPORTANT: The reason for this is that we used to have the wrong defaults saved in PretrainedConfig() - see e.g. here the default value for any model for pad_token_id was 0. People trained a model with the lib, saved it and the resulting config.json now had a pad_token_id = 0 saved. This was then uploaded. But it's wrong and should be corrected.

For 162 models everything is fine!

Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here

HOW-TO-FIX-STEPS (in the following order):

[x] Fix 1 a) i. first: All models that have a wrong model identifier path should get the correct one. Need to update some model identifier paths on https://huggingface.co/models like changing bertabs-finetuned-xsum-extractive-abstractive-summarization to remi/bertabs-finetuned-xsum-extractive-abstractive-summarization. Some of those errors are very weird, see #3358
[ ] Fix 1 a) ii. shoud be quite easy to add the correct model_type to the config.json
[ ] Fix 1 b) Not sure how to fix the lacking tokenizer files most efficiently @julien-c
[x] Fix 2) Create automated script that:
- 1. If tokenizer.pad_token_id != default_config.pad_token_id -> config.pad_token_id = tokenizer.pad_token_id else remove pad_token_id.
- 1. Removes all eos_token_ids -> they don't exist anymore

julien-c commented 4 years ago

Item 1) a) i. is fixed (list of model ids below for reference)
For models which don't have a tokenizer, or an auto-detected model type, we'll add a notice on their model page (and remove the code sample which is misleading as it lists AutoModel and AutoConfig)

albert-base
albert-large
albert-xlarge
albert-xxlarge
bert-base-multilingual-cased-finetuned-conll03-dutch
bert-base-multilingual-cased-finetuned-conll03-spanish
mlm-100-1280
mlm-17-1280
bertabs-finetuned-cnndm-extractive-abstractive-summarization
bertabs-finetuned-extractive-abstractive-summarization
bertabs-finetuned-xsum-extractive-abstractive-summarization

patrickvonplaten commented 4 years ago

UPDATE:

Stats

61 can't load either their config (n)or their tokenizer:
- a) 23 models can't load their config file. The reasons for this are as follows:There is an unrecognized model_type in the config.json, e.g.
  
  "Error: Message: Unrecognized model in hfl/rbtl3. Should have a model_type key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "
- b) 38 models can load their config, but cannot load their tokenizers. The error message is always the same:

TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name 'clue/roberta_chinese_base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'clue/roberta_chinese_base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

Here: the model has neither of:
- vocab_file
- added_tokens_file
- special_tokens_map_file
- tokenizer_config_file

For 254 models everything is fine!

Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here

NEXT STEPS

1a) and 1b) cannot really be fixed by us because for 1a) we don't know which model_type is used and for 1b) if the tokenizer does not work or does not exist it should be fixed or uploaded by the author. These 61 models can probably still be used if the correct model class is used instead of AutoModel.from_pretrained(...)

We could contact the authors or add a warning sign to the model page.

liuchenbaidu commented 3 years ago

the problem of denpa92/bert-base-cantonese is not solved.

patrickvonplaten commented 3 years ago

hey @liuchenbaidu , I'd recommend contacting the author of the model in this case.

XiangQinYu commented 3 years ago

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.

drussellmrichie commented 2 years ago

When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.

Hi, @XiangQinYu. I'm a bit of a newbie with Huggingface. Can you say more about how you did this? I guess you mean adding "model_type":"bert" to a file like this. But how did you edit the file? Did you download the whole model repository, and edit and run it locally?

EDIT: Nevermind, figured it out with help of a commenter on a question I asked on SO.

huggingface / transformers

Some community models are broken and can't be downloaded #3359

🐛 Bug

Information

Stats

HOW-TO-FIX-STEPS (in the following order):

UPDATE:

Stats

NEXT STEPS