Closed patrickvonplaten closed 4 years ago
1) a) i.
is fixed (list of model ids below for reference)albert-base
albert-large
albert-xlarge
albert-xxlarge
bert-base-multilingual-cased-finetuned-conll03-dutch
bert-base-multilingual-cased-finetuned-conll03-spanish
mlm-100-1280
mlm-17-1280
bertabs-finetuned-cnndm-extractive-abstractive-summarization
bertabs-finetuned-extractive-abstractive-summarization
bertabs-finetuned-xsum-extractive-abstractive-summarization
61 can't load either their config (n)or their tokenizer:
a) 23 models can't load their config file. The reasons for this are as follows:There is an unrecognized model_type
in the config.json, e.g.
"Error: Message: Unrecognized model in hfl/rbtl3. Should have a
model_type
key in its config.json, or contain one of the following strings in its name: t5, distilbert, albert, camembert, xlm-roberta, bart, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl "
b) 38 models can load their config, but cannot load their tokenizers. The error message is always the same:
TOK ERROR: clue/roberta_chinese_base tokenizer can not be loaded Message: Model name 'clue/roberta_chinese_base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'clue/roberta_chinese_base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
vocab_file
added_tokens_file
special_tokens_map_file
tokenizer_config_file
Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here
1a) and 1b) cannot really be fixed by us because for 1a) we don't know which model_type
is used and for 1b) if the tokenizer does not work or does not exist it should be fixed or uploaded by the author. These 61 models can probably still be used if the correct model class is used instead of AutoModel.from_pretrained(...)
We could contact the authors or add a warning
sign to the model page.
the problem of denpa92/bert-base-cantonese is not solved.
hey @liuchenbaidu , I'd recommend contacting the author of the model in this case.
When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.
When I use ernie model pretained by BaiDu, I had the same problem. My solution is to add "model_type":"bert" to the configuration file, It worked, but I don't know if it's reasonable.
Hi, @XiangQinYu. I'm a bit of a newbie with Huggingface. Can you say more about how you did this? I guess you mean adding "model_type":"bert" to a file like this. But how did you edit the file? Did you download the whole model repository, and edit and run it locally?
EDIT: Nevermind, figured it out with help of a commenter on a question I asked on SO.
π Bug
Information
Model I am using (Bert, XLNet ...): Community Models
Language I am using the model on (English, Chinese ...): Multiple different ones
Quite some community models can't be loaded. The stats are here:
Stats
68 can't load either their config (n)or their tokenizer:
a) 34 models can't even load their config file. The reasons for this are either:
i. 11/34: Model identifier is wrong, e.g.
albert-large
does not exist anymore, it seems like it was renamed toalbert-large-v1
. These models have saved the another name online than how it is saved on AWS.ii. 23/34: There is an unrecognized
model_type
in the config.json,e.g.
b) 33 models can load their config, but cannot load their tokenizers. The error message is almost always the same:
i. Here: the model has neither of:
vocab_file
added_tokens_file
special_tokens_map_file
tokenizer_config_file
pad_token_id
,eos_token_id
,bos_token_id
in their configs. IMPORTANT: The reason for this is that we used to have the wrong defaults saved inPretrainedConfig()
- see e.g. here the default value for any model forpad_token_id
was 0. People trained a model with the lib, saved it and the resulting config.json now had apad_token_id = 0
saved. This was then uploaded. But it's wrong and should be corrected.Here the full analysis log here Here the code that created this log (simple comparison of loaded tokenizer and config with default config): here
HOW-TO-FIX-STEPS (in the following order):
[x] Fix 1 a) i. first: All models that have a wrong model identifier path should get the correct one. Need to update some model identifier paths on
https://huggingface.co/models
like changingbertabs-finetuned-xsum-extractive-abstractive-summarization
toremi/bertabs-finetuned-xsum-extractive-abstractive-summarization
. Some of those errors are very weird, see #3358[ ] Fix 1 a) ii. shoud be quite easy to add the correct
model_type
to the config.json[ ] Fix 1 b) Not sure how to fix the lacking tokenizer files most efficiently @julien-c
[x] Fix 2) Create automated script that:
If tokenizer.pad_token_id != default_config.pad_token_id
->config.pad_token_id = tokenizer.pad_token_id else
removepad_token_id
.eos_token_ids
-> they don't exist anymore