Open nbroad1881 opened 4 months ago
@OlivierDehaene ,
This may be an issue with older models on the hub both for the tokenizer and the config.json.
Older Bert models won't have a tokenizer.json
file.
SequenceClassification models won't have num_labels
, id2label
, or label2id
in config.json
Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?
SequenceClassification models won't have num_labels, id2label, or label2id in config.json
Do you have an example?
Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?
For tokenizer.json, TEI will not be able to replace it. For the other case I'm not sure and would like to explore the examples to figure it out.
I had to make a pull request on this model to get it working with TEI: https://huggingface.co/ibm/re2g-reranker-nq
On second glance, this might be an anomaly. Other older models seem fine:
https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment/blob/main/config.json https://huggingface.co/jb2k/bert-base-multilingual-cased-language-detection/blob/main/config.json
I have the same issue with not having tokenizer.json with old models. Is there any work around for we to have "tokenizer.json". As far as I know this is from FastTokenizer class? https://huggingface.co/docs/transformers/en/fast_tokenizers
SequenceClassification models won't have num_labels, id2label, or label2id in config.json
Do you have an example?
@OlivierDehaene How about this : https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco/blob/main/config.json
Attempts to run this model with TEI yields the following error:
Error: `config.json` does not contain `id2label`
For info, below is the command that i ran:
model=amberoad/bert-multilingual-passage-reranking-msmarco
volume=$PWD/models
docker run -p 8088:80 -v $volume:/models --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.1.0 --model-id $model
System Info
When trying to use this model (ibm/re2g-reranker-trex) in TEI, it will error because there is no tokenizer.json file. If I call
AutoTokenizer.from_pretrained("ibm/re2g-reranker-trex")
, there aren't any issues creating the tokenizer.I have opened a pull request on the model page to include the tokenizer.json file, but I'm wondering if something should/could be done on the TEI side.
Information
Tasks
Reproduction
Expected behavior
No error