michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
971 stars 72 forks source link

ValueError: No onnx files found #225

Open netw0rkf10w opened 1 month ago

netw0rkf10w commented 1 month ago

Hello,

First of all thank you very much for this tool!

I am trying it out (on CPU) with the following code:

DEVICE = os.environ.get("DEVICE", "cpu")
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
engine = AsyncEmbeddingEngine.from_args(
        EngineArgs(
            model_name_or_path=MODEL_NAME,
            device=DEVICE,
            batch_size=1,
            lengths_via_tokenize=False,
            model_warmup=True,
            engine="torch" if DEVICE.startswith("cuda") else "optimum",
        )
    )

async def encode_infinity(sentences: list[str]):
    return np.array((await engine.embed(sentences))[0])

encode_infinity(["Hello"])

and obtained the following error:

 File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/engine.py", line 62, in from_args
    engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/engine.py", line 48, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/inference/select_model.py", line 62, in select_model
    loaded_engine = unloaded_engine.value(engine_args=engine_args)
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/transformer/embedder/optimum.py", line 38, in __init__
    onnx_file = get_onnx_files(
  File "/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/infinity_emb/transformer/utils_optimum.py", line 202, in get_onnx_files
    raise ValueError(
ValueError: No onnx files found for sentence-transformers/all-MiniLM-L6-v2 and revision None

In the documentation it is said that any model from Sentence Transformer could be used, but that doesn't seem to be the case. I guess I'll have to manually convert the weights to ONNX and place it somewhere for it to work?

Thank you in advance for your help!

michaelfeil commented 1 month ago

Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2

michaelfeil commented 1 month ago

Does this work @netw0rkf10w ?

netw0rkf10w commented 1 month ago

@michaelfeil Thanks for your reply. I managed to get it working but the latency is too high, something must be wrong:

import os
import asyncio
import time

from infinity_emb import AsyncEmbeddingEngine, EngineArgs
from sentence_transformers import SentenceTransformer

DEVICE = os.environ.get("DEVICE", "cpu")
MODEL_NAME = 'onnx_models/sentence-transformers/all-MiniLM-L6-v2'
engine = AsyncEmbeddingEngine.from_args(
        EngineArgs(
            model_name_or_path=MODEL_NAME,
            device=DEVICE,
            batch_size=1,
            lengths_via_tokenize=False,
            model_warmup=True,
            engine="torch" if DEVICE.startswith("cuda") else "optimum",
        )
    )

async def encode_infinity(sentences: list[str]):
    async with engine: # engine starts with engine.astart()
        embeddings, usage = await engine.embed(sentences)
    return embeddings

async def test_infty(sentences):
    start = time.monotonic()
    embeddings_inf = await encode_infinity(sentences)
    print('infinity time: ', time.monotonic() - start)

def test_sbert(sentences):
    model_minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    model_minilm.eval()
    start = time.time()
    embeddings = model_minilm.encode(sentences)
    print('sbert time: ', time.time() - start)

if __name__ == "__main__":
    sentences = ["Un avion est en train de décoller.",
            "Un homme joue d'une grande flûte.",
            "Un homme étale du fromage râpé sur une pizza.",
            "Une personne jette un chat au plafond.",
            "Une personne est en train de plier un morceau de papier.",
            ]
    asyncio.run(test_infty(sentences))
    test_sbert(sentences)

Running on a CPU-only machine, I obtained:

infinity time:  1.0698386139999911
sbert time:  0.09578514099121094

What did I do wrong please? The full output is appended for your information. Thanks!

$ python infty.py 
INFO     2024-05-18 11:58:42,796 datasets INFO: PyTorch version 2.3.0+cpu available.                                                                         config.py:58
INFO     2024-05-18 11:58:44,357 infinity_emb INFO: model=`onnx_models/sentence-transformers/all-MiniLM-L6-v2` selected, using engine=`optimum` and    select_model.py:54
         device=`cpu`                                                                                                                                                    
INFO     2024-05-18 11:58:44,360 infinity_emb INFO: Found 2 onnx files:                                                                              utils_optimum.py:193
         [PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx'),                                                                          
         PosixPath('onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx')]                                                                                     
INFO     2024-05-18 11:58:44,362 infinity_emb INFO: Using onnx_models/sentence-transformers/all-MiniLM-L6-v2/model.onnx as the model                 utils_optimum.py:197
INFO     2024-05-18 11:58:44,364 infinity_emb INFO: Optimized model found at onnx_models/sentence-transformers/all-MiniLM-L6-v2/model_optimized.onnx, utils_optimum.py:99
         skipping optimization                                                                                                                                           
INFO     2024-05-18 11:58:44,691 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=3                                     select_model.py:77
                 0.16     ms tokenization                                                                                                                                
                 7.08     ms inference                                                                                                                                   
                 0.15     ms post-processing                                                                                                                             
                 7.39     ms total                                                                                                                                       
         embeddings/sec: 135.29                                                                                                                                          
INFO     2024-05-18 11:58:45,270 infinity_emb INFO: Getting timings for batch_size=1 and avg tokens per sentence=512                                   select_model.py:83
                 1.99     ms tokenization                                                                                                                                
                 282.79   ms inference                                                                                                                                   
                 0.17     ms post-processing                                                                                                                             
                 284.95   ms total                                                                                                                                       
         embeddings/sec: 3.51                                                                                                                                            
INFO     2024-05-18 11:58:45,273 infinity_emb INFO: model warmed up, between 3.51-135.29 embeddings/sec at batch_size=1                                select_model.py:84
INFO     2024-05-18 11:58:45,276 infinity_emb INFO: creating batching engine                                                                         batch_handler.py:291
INFO     2024-05-18 11:58:45,278 infinity_emb INFO: ready to batch requests.                                                                         batch_handler.py:354
infinity time:  1.0698386139999911
INFO     2024-05-18 11:58:46,347 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer:                          SentenceTransformer.py:113
         sentence-transformers/all-MiniLM-L6-v2                                                                                                                          
/home/all/miniconda3/envs/env2/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO     2024-05-18 11:58:47,485 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cpu                                  SentenceTransformer.py:219
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.77it/s]
sbert time:  0.09578514099121094
michaelfeil commented 1 month ago

In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time.

netw0rkf10w commented 1 month ago

@michaelfeil Thanks, but could you please tell me how to do it correctly? I couldn't find it in the doc, sorry.

michaelfeil commented 1 month ago

Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches / long sequences.

import asyncio
from infinity_emb import AsyncEmbeddingEngine, EngineArgs

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
engine = AsyncEmbeddingEngine.from_args(EngineArgs(model_name_or_path = "michaelfeil/bge-small-en-v1.5", engine="optimum"))

async def main(): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    t_start = time.time()
    embeddings, usage = await engine.embed(sentences=sentences)
    print(time.time() - time.start())
    await engine.astop()
asyncio.run(main())