tokenizer.json required for TEI?

huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models

https://huggingface.co/docs/text-embeddings-inference/quick_tour

Apache License 2.0

2.29k stars 138 forks source link

tokenizer.json required for TEI? #169

Open nbroad1881 opened 4 months ago

nbroad1881 commented 4 months ago

System Info

When trying to use this model (ibm/re2g-reranker-trex) in TEI, it will error because there is no tokenizer.json file. If I call AutoTokenizer.from_pretrained("ibm/re2g-reranker-trex"), there aren't any issues creating the tokenizer.

I have opened a pull request on the model page to include the tokenizer.json file, but I'm wondering if something should/could be done on the TEI side.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

model=ibm/re2g-reranker-trex 
volume=$PWD/data 

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model

Expected behavior

No error

nbroad1881 commented 4 months ago

@OlivierDehaene ,

This may be an issue with older models on the hub both for the tokenizer and the config.json.

Older Bert models won't have a tokenizer.json file.

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?

OlivierDehaene commented 4 months ago

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Do you have an example?

Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?

For tokenizer.json, TEI will not be able to replace it. For the other case I'm not sure and would like to explore the examples to figure it out.

nbroad1881 commented 4 months ago

I had to make a pull request on this model to get it working with TEI: https://huggingface.co/ibm/re2g-reranker-nq

nbroad1881 commented 4 months ago

On second glance, this might be an anomaly. Other older models seem fine:

https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment/blob/main/config.json https://huggingface.co/jb2k/bert-base-multilingual-cased-language-detection/blob/main/config.json

AndrewNgo-ini commented 4 months ago

I have the same issue with not having tokenizer.json with old models. Is there any work around for we to have "tokenizer.json". As far as I know this is from FastTokenizer class? https://huggingface.co/docs/transformers/en/fast_tokenizers

w3iw3i commented 3 months ago

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Do you have an example?

@OlivierDehaene How about this : https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco/blob/main/config.json

Attempts to run this model with TEI yields the following error:

Error: `config.json` does not contain `id2label`

For info, below is the command that i ran:

model=amberoad/bert-multilingual-passage-reranking-msmarco
volume=$PWD/models

docker run -p 8088:80 -v $volume:/models --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.1.0 --model-id $model