siddharth-sharma7 / fast-Bart

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed
34 stars 3 forks source link

Inference slower than Pytorch model for long sequence length #3

Open jasontian6666 opened 2 years ago

jasontian6666 commented 2 years ago

Hi @siddharth-sharma7

Thank you for providing fast-bart. It has made my life much easier.

I find the bart-onnx-quantized model 2-3x faster than the Pytorch model. However, when the sequence length is long (~500 tokens), the onnx-based model is 1.5-2x slower.

I also find a similar problem for T5-onnx model that has been discussed at https://github.com/microsoft/onnxruntime/issues/6835#:~:text=the%20converted%20t5%20onnx%20model,and%20higher%20beam%2Dsearch%20number.

Just wondering if we're facing the same issue here.

sidsharma72 commented 2 years ago

In my experiments with longer input sequences (~500 tokens), the onnx performance is only slightly slower than that of the Pytorch model, if not similar. The performance gains o Onnx over Pytorch do diminish for longer sequences, especially above 400 tokens, etc.