TensorrtExecutionProvider slower than CUDAExecutionProvider: Transformers

oborchers commented 3 years ago

EDIT: Provided proper notebook to replicate issue and dockerfile

Describe the bug The CUDAExecutionProvider provides a 2x speedup over the TensorrtExecutionProvider, which is completely counterintuitive.

Urgency None

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
ONNX Runtime installed from (source or binary): Source
ONNX Runtime version: 1.7.1
Python version: 3. 8
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.1.1 / 8.0.5
GPU model and memory: Nvidia V100

To Reproduce

1) Build the following Dockerfile (may take some time)

https://github.com/oborchers/Medium_Repo/blob/master/onnxruntime-issues/Dockerfile

2) Run the following notebook:

https://github.com/oborchers/Medium_Repo/blob/master/onnxruntime-issues/TensorRT%20Slow.ipynb

Expected behavior At least to my intuition, TensorRT should be faster.

pranavsharma commented 3 years ago

@stevenlix can you please take a look? Thx.

stevenlix commented 3 years ago

TensorRT usually takes much longer time to build engine. Please add a warmup session (i.e. call sess.run once) before inference, so that engine can be built in advance. Another thing is if the model can't run as a whole in TensorRT, you may add CUDA as a fallback, for example, sess = rt.InferenceSession(str(model_pth), opt, providers=["TensorrtExecutionProvider", "CUDAExecutionProvider"])

oborchers commented 3 years ago

@stevenlix : Thank you for your comments! I have gone through all of those possibilities and the results are exactly the same. Furthermore, I provided a notebook for exact reproduction of the behavior here: https://github.com/oborchers/Medium_Repo/blob/master/onnxruntime-issues/TensorRT%20Slow.ipynb Dockerfile can be found here: https://github.com/oborchers/Medium_Repo/blob/master/onnxruntime-issues/Dockerfile

The speeds look as follows:

Torch: 95 sentences/s
CUDA + CPU: 384 sentences/s
CUDA: 370 sentences/s
Tensorrt + CUDA + CPU: 91 sentences/s
Tensorrt: 91 sentences/s

Additional Info:

Tensorrt runs have been performed with 50 warmup iterations
GPU usage via nvidia-smi
- Torch: fluctuates around 25%
- CUDA+CPU: fluctuates around 65-75%
- Tensorrt: wildly varies between 0 to 30-40%, and maxing at 70% while going back to 0%. Rinse, repeat.

This is also something I have experience half a year ago while trying this, but postponed investigation as I assumed I just did something wrong. But now that I am able replicate the exact same problem in Docker, I can safely discard a problem on the side on our master server configuration.

GPUs otherwise not occupied.

stevenlix commented 3 years ago

If sequence length of the model is dynamic, you may need to build an engine in warmup to cover the shape range, otherwise TensorRT will rebuild engine during inference every time when new shape inputs are coming. For example, if sequence length varies from A to B, in warmup you may call session.run once with input shape[1,A] and call another session.run with input shape[1, B], then the engine will be used for any sequence length between A and B, and will not be rebuilt during inference.

oborchers commented 3 years ago

@stevenlix:

This partially addresses the issue. I tried to do exactly that, and this has the side effect that the GPU usage now remains constantly at 85-90%. Encoding 10,000 sentences took 38s at 263 sentences/s. However, I am still nowhere near to beating base CUDA, which is residing at around 400-450 sentences/s.

Next I tried to pad all inputs to the maximum length and sub-lengths. The following padding sizes result in the following performance:

128: 90% GPU usage @ 192 sentences/s
256: 92% GPU usage @ 128 sentences/s
512: 95% GPU usage @ 75 sentences/s

I also played around with the environment variables, but this does not seem to have any effect:

export ORT_TENSORRT_MAX_BATCH_SIZE=1
export ORT_TENSORRT_MAX_WORKSPACE_SIZE=8_589_934_592
export ORT_TENSORRT_MAX_PARTITION_ITERATIONS=40
export ORT_TENSORRT_MIN_SUBGRAPH_SIZE=1

mudong0419 commented 3 years ago

I have the same problem. When inference with Bert(HuggingFace), TensorrtExecutionProvider is slower than CUDAExecutionProvider.

LeeJiangWei commented 2 years ago

I have the same problem. When inference with Bert(HuggingFace), TensorrtExecutionProvider is slower than CUDAExecutionProvider.

Same here using HugginceFace models. It's a warm-up issue as @stevenlix said, and it will be much faster the second time you call sess.run().

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime

TensorrtExecutionProvider slower than CUDAExecutionProvider: Transformers #7230