michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
959 stars 71 forks source link

Deberta v3 not working #241

Closed Stealthwriter closed 1 month ago

Stealthwriter commented 1 month ago

System Info

I started the server on runpod with deberta v3, but I got this and the model didn't download:

root@eb4c9177bc5e:/workspace# infinity_emb v2 --model-id microsoft/deberta-v3-large --port 8000 INFO 2024-06-01 20:01:40,608 datasets INFO: PyTorch version 2.3.0 available. config.py:58 INFO: Started server process [1091] INFO: Waiting for application startup. INFO 2024-06-01 20:01:41,678 infinity_emb INFO: model=microsoft/deberta-v3-large selected, using engine=torch and device=None select_model.py:54 INFO 2024-06-01 20:01:41,848 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cuda SentenceTransformer.py:188 INFO 2024-06-01 20:01:41,849 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: SentenceTransformer.py:196 microsoft/deberta-v3-large
WARNING 2024-06-01 20:01:41,934 sentence_transformers.SentenceTransformer WARNING: No sentence-transformers model found with name SentenceTransformer.py:1298 microsoft/deberta-v3-large. Creating a new one with mean pooling.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:560: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn( INFO 2024-06-01 20:01:44,327 infinity_emb INFO: Adding optimizations via Huggingface optimum. acceleration.py:25 WARNING 2024-06-01 20:01:44,329 infinity_emb WARNING: BetterTransformer is not available for model: <class acceleration.py:36 'transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2Model'> Continue without bettertransformer modeling code.
INFO 2024-06-01 20:01:44,330 infinity_emb INFO: Switching to half() precision (cuda: fp16). sentence_transformer.py:73 Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. INFO 2024-06-01 20:01:44,944 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=1 select_model.py:77 0.60 ms tokenization
17.30 ms inference
0.08 ms post-processing
17.99 ms total
embeddings/sec: 1778.96
INFO 2024-06-01 20:01:45,633 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=512 select_model.py:83 14.28 ms tokenization
317.46 ms inference
0.26 ms post-processing
332.00 ms total
embeddings/sec: 96.39
INFO 2024-06-01 20:01:45,635 infinity_emb INFO: model warmed up, between 96.39-1778.96 embeddings/sec at batch_size=32 select_model.py:84 INFO 2024-06-01 20:01:45,636 infinity_emb INFO: creating batching engine batch_handler.py:291 INFO 2024-06-01 20:01:45,638 infinity_emb INFO: ready to batch requests. batch_handler.py:354 INFO 2024-06-01 20:01:45,640 infinity_emb INFO: infinity_server.py:56

     ♾️  Infinity - Embedding Inference Server                                                                                                               
     MIT License; Copyright (c) 2023 Michael Feil                                                                                                           
     Version 0.0.39                                                                                                                                         

     Open the Docs via Swagger UI:                                                                                                                          
     http://0.0.0.0:8000/docs                                                                                                                               

     Access model via 'GET':                                                                                                                                
     curl http://0.0.0.0:8000/models                                                                                                                        

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: 100.64.0.28:42100 - "GET / HTTP/1.1" 307 Temporary Redirect INFO: 100.64.0.28:42100 - "GET /docs HTTP/1.1" 200 OK INFO: 100.64.0.23:49954 - "GET / HTTP/1.1" 307 Temporary Redirect INFO: 100.64.0.23:49954 - "GET /docs HTTP/1.1" 200 OK INFO: 100.64.0.23:49954 - "GET /openapi.json HTTP/1.1" 200 OK INFO: 100.64.0.23:51264 - "POST /classify HTTP/1.1" 400 Bad Request INFO: 100.64.0.23:34484 - "POST /classify HTTP/1.1" 400 Bad Request INFO: 100.64.0.23:54440 - "POST /classify HTTP/1.1" 400 Bad Request

Information

Tasks

Reproduction

infinity_emb v2 --model-id microsoft/deberta-v3-large

Expected behavior

to work with classify endpoint since its classification model

michaelfeil commented 1 month ago

Hey!

this model does not has a classification head , or id2label in its config. https://huggingface.co/microsoft/deberta-v3-large/blob/main/config.json

therefore, you maybe can use it for embeddings, but it should be not well performing. Essentially its a model for mask-filling, which is not a downstream task that is interesting to support in infinity.