Using tensorrt provider occasionally see dramatically increased inference time

brevity2021 commented 2 years ago

Update 2022-01-03: This original post can be seen as the background information. The issue originally described here are solved by using TensorRT 8.2 (see Reply 1 below) , but the dramatically increased inference time (described in the reply 2) is concerning to us. Can anyone help?

Describe the bug I was trying a summarization task using encoder/decoder onnx models exported from the huggingface pegasus-xsum checkpoint., with tensorrt or cuda execution provider. The results are fine when using cuda provider, but when I switch to use the tensorrt provider, the decoder results have errors. All the next token logits seem to be the same. (If only running the encoder part they produce the same result, though)

I wonder if there is anything wrong with my setup, or could this be some issues with the tensorrt? I attach the detailed steps in the "To reproduce" section, and errors in the "additional context" section. Thanks a lot!

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.10
Python version: 3.8
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory: V100 16G

To Reproduce This was done in a nvidia tensorrt 21.10-py3 container (with tensorrt 8.0), with onnxruntime 1.10 installed by pip, on an aws p3.2xlarge instance (with one nvidia v100 gpu)

Export the Onnx model from huggingface pegasus-xsum checkpoint using this script in a Jupyter notebook.
Run this script to generate a few test summarized sentences. Turn --use_tensorrt on/off to use tensorrt provider or cuda provider.
When inspecting the printed debug information, using tensorrt provider generates the same encoder results for every sentence, but the decoder results are not correct. We print the top 5 log_softmax of next token logit. Using tensorrt provider gives the same result, e.g. [[-11.473176 -11.473176 -11.473176 -11.473176 -11.473176]] for every sentence. Using cuda provider gives reasonable results, e.g. [[-0.8437815 -3.0716448 -3.7427797 -3.8304396 -3.8826933]]

Expected behavior Tensorrt provider should output the same result as cuda provider.

Additional context When using the tensorrt provider, there are warnings printed out regarding the decoder input shape:

2021-12-30 20:48:00.654247260 [W:onnxruntime:Default, tensorrt_execution_provider.h:53 log] [2021-12-30 20:48:00 WARNING] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 2021-12-30 20:48:00.654271023 [W:onnxruntime:Default, tensorrt_execution_provider.h:53 log] [2021-12-30 20:48:00 WARNING] (# 1 (SHAPE encoder_hidden_states)) 2021-12-30 20:48:00.654281071 [W:onnxruntime:Default, tensorrt_execution_provider.h:53 log] [2021-12-30 20:48:00 WARNING] (# 1 (SHAPE input_ids))

Also some warnings regarding to some MatMul nodes, e.g.

Warning: Incompatible mask dimensions in fully-connected op between: (Unnamed Layer_ 449) [Softmax]_output'ForeignNode[5060 (Unnamed Layer_ 599) [Shuffle]_MatMul487]:f32,trA=false,[-1,4,3] and 957'ForeignNode[5060 (Unnamed Layer 599) [Shuffle]MatMul487]:f32,trB=false,[-1,-1,-1].

No such warnings were produced when using the cuda provider.

brevity2021 commented 2 years ago

After using TensorRT 8.2 (container 21.12-py3) + onnxruntime 1.10.0 the incorrect results seem to go away, so it does seem a tensorRT issue.

The warning posted above was still there, though. Wonder if this is expected?

brevity2021 commented 2 years ago

Another issue I noticed (which might be more concerning than the original issue to us) is, although we get the speedup most of the time with the Tensorrt provider, there are a few cases where the inference time can dramatically increase.

Here are my inference time results on a ~2500 sentence test set using the same code -- the first one takes around 3,900,000ms, the 2nd/3rd/4th/5th all takes > 100,000ms. Starting from the 6th sentences, the time begins to be around 100-200ms, but the inference time occasionally increases to > 100,000ms for the following sentences: 13, 14, 15, 79, 141, 258, 1557, 1652, 2160. It doesn't seem to be directly related to the sentence length, as the words in the sentences are 187,41,42,37,257,250,186,37,34(some of them is not particularly long). Generally the sentences in the test set are 30-250 words long.

I wonder what might be causing this occasional increase of inference time? Is it a TensorRT re-warmup? Is it likely a TensorRT issue or OnnxRuntime issue?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime

Using tensorrt provider occasionally see dramatically increased inference time #10159