Inference slower than Pytorch model for long sequence length

siddharth-sharma7 / fast-Bart

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

34 stars 3 forks source link

Hi @siddharth-sharma7

Thank you for providing fast-bart. It has made my life much easier.

I find the bart-onnx-quantized model 2-3x faster than the Pytorch model. However, when the sequence length is long (~500 tokens), the onnx-based model is 1.5-2x slower.

I also find a similar problem for T5-onnx model that has been discussed at https://github.com/microsoft/onnxruntime/issues/6835#:~:text=the%20converted%20t5%20onnx%20model,and%20higher%20beam%2Dsearch%20number.

Just wondering if we're facing the same issue here.

siddharth-sharma7 / fast-Bart

Inference slower than Pytorch model for long sequence length #3