michaelfeil / infinity

Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
https://michaelfeil.github.io/infinity/
MIT License
1.49k stars 115 forks source link

when use engine optimum device tensorrt,startup fail #372

Open weibingo opened 2 months ago

weibingo commented 2 months ago

System Info

infinity_emb v2 --model_id /home/xxxx/peg_onnx --served-model-name embedding --engine optimum --device tensorrt --batch-size 32 OS: linux model_base PEG nvidia-smi: cuda version 11.8, tensorrt: 8.6.1

Information

Tasks

Reproduction

1、just startup

Expected behavior

python3.10/dist-packages/optimum/onnxruntime/model_ort.py line 1444, in forward model_outputs = self.__prepare_onnx_outputs(use_torch, **onnx_outputs) python3.10/dist-packages/optimum/onnxruntime/modeling_ort.py line 939 in __prepare_onnx_outputs model_outputs[output_name]=onnx_outputs[idx] IndexError: tuple index out of range

then i print log with model run inputs and outputs , find warmup model , first inference is ok , twice is error if i startup with --no-model-warmup, server can startup , but twice inference also error

michaelfeil commented 2 months ago

@weibingo Any chance you have a similar model from huggingface? Are you using optimum-gpu or tensorrt backend? Are you sure tensorrt is correct installed?

weibingo commented 2 months ago

@weibingo Any chance you have a similar model from huggingface? Are you using optimum-gpu or tensorrt backend? Are you sure tensorrt is correct installed?

yes。model i use optimum cuda is ok 。 tensorrt env alse have error,but i resolved。 i test the embedder.optimum.py , directly init OptimumEmbedder,and the error exists。 then i look source code, at utils_optimum.py , i find tensorrtExecutionProvider options without trt_cude_graph_enble can work. but i don't understand why can work and if has trt_cude_graph_enble can't work

michaelfeil commented 2 months ago

@weibingo No idea why cuda graph capture does not work. I have not used trt much, it only had marginal performance gains over onnx-gpu.

weibingo commented 2 months ago

@michaelfeil so you don't test with engine optimum, device tensorrt ?

michaelfeil commented 2 months ago

@weibingo it’s not possible to test in ci (which is cpu) & i have not used it locally in the last 3 months. Before that, it was extensively tested with 8.6.1