michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip
https://michaelfeil.github.io/infinity/
MIT License
1.26k stars 86 forks source link

Issue running cross-encoder onnx model exported with optimum-cli #361

Open rawsh opened 4 days ago

rawsh commented 4 days ago

System Info

py3.10 infinity-emb 0.0.55

Running with optimum engine fails:

INFO     2024-09-13 15:17:02,874 datasets INFO: PyTorch version 2.4.0 available.                                                            config.py:59
INFO:     Started server process [76741]
INFO:     Waiting for application startup.
INFO     2024-09-13 15:17:03,950 infinity_emb INFO: model=`rawsh/ms-marco-TinyBERT-L-2-ONNX` selected, using engine=`optimum` and     select_model.py:62
         device=`cpu`                                                                                                                                   
INFO     2024-09-13 15:17:04,356 infinity_emb INFO: Optimized model found at                                                        utils_optimum.py:120
         /Users/robert/.cache/huggingface/hub/infinity_onnx/CPUExecutionProvider/rawsh/ms-marco-TinyBERT-L-2-ONNX/model_optimized.o                     
         nnx, skipping optimization                                                                                                                     
ERROR:    Traceback (most recent call last):
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/infinity_server.py", line 63, in lifespan
    app.engine_array = AsyncEngineArray.from_args(engine_args_list)  # type: ignore
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/engine.py", line 259, in from_args
    return cls(engines=tuple(engines))
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/engine.py", line 67, in from_args
    engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/engine.py", line 53, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/inference/select_model.py", line 76, in select_model
    loaded_engine.warmup(batch_size=engine_args.batch_size, n_tokens=1)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 170, in warmup
    return run_warmup(self, inp)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 178, in run_warmup
    embed = model.encode_core(feat)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/infinity_emb/transformer/crossencoder/optimum.py", line 78, in encode_core
    outputs = self.model(**features, return_dict=True)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/optimum/modeling_base.py", line 99, in __call__
    return self.forward(*args, **kwargs)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 1460, in forward
    onnx_inputs = self._prepare_onnx_inputs(use_torch, **model_inputs)
  File "/Users/robert/Library/Caches/pypoetry/virtualenvs/genai-toolbox-JUYepP8o-py3.10/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 943, in _prepare_onnx_inputs
    if onnx_inputs[input_name].dtype != self.input_dtypes[input_name]:
AttributeError: 'NoneType' object has no attribute 'dtype'

Information

Tasks

Reproduction

converted model to onnx: rawsh/ms-marco-TinyBERT-L-2-ONNX process:

optimum-cli export onnx --model cross-encoder/ms-marco-TinyBERT-L-2-v2 ms-marco-tinybert
huggingface-cli upload rawsh/ms-marco-TinyBERT-L-2-ONNX ms-marco-tinybert .

(unrelated: can't figure out how to run a local model)

run with

infinity_emb v2 --model-id rawsh/ms-marco-TinyBERT-L-2-ONNX --device cpu --engine optimum

Expected behavior

no error when running with optimum

michaelfeil commented 4 days ago

Has Xenova the model prepared to onnx?

rawsh-rubrik commented 4 days ago

@michaelfeil Yes, also throws the same for me