urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
1.27k stars 108 forks source link

Onnx converted model has slower inference #191

Open yogitavm opened 5 days ago

yogitavm commented 5 days ago

I finetuned gliner small v2.1 model and created onnx version of the same model using the convert_to_onnx.ipynb exmple code. When I compared the inference time of both models, the onnx version took 50% more time.

This is how I'm loading the model: model = GLiNER.from_pretrained(model_path, load_onnx_model=True, load_tokenizer=True)

Ingvarstep commented 14 hours ago

From my experiments, ONNX models work faster for sequences smaller than 124 words. With a longer input sequence, attention becomes the limiting factor and ONNX is not necessarily more efficient than PyTorch. The main purpose of ONNX is to enable easier conversion of models between different frameworks and running in other environments. If you need efficient inference on CPU I would recommend to try GLiNER.cpp it is consistently faster than Pytorch and enables up to 2x acceleration.