Open yogitavm opened 5 days ago
From my experiments, ONNX models work faster for sequences smaller than 124 words. With a longer input sequence, attention becomes the limiting factor and ONNX is not necessarily more efficient than PyTorch. The main purpose of ONNX is to enable easier conversion of models between different frameworks and running in other environments. If you need efficient inference on CPU I would recommend to try GLiNER.cpp it is consistently faster than Pytorch and enables up to 2x acceleration.
I finetuned gliner small v2.1 model and created onnx version of the same model using the convert_to_onnx.ipynb exmple code. When I compared the inference time of both models, the onnx version took 50% more time.
This is how I'm loading the model: model = GLiNER.from_pretrained(model_path, load_onnx_model=True, load_tokenizer=True)