Open Eichhof opened 1 year ago
From GPU only 2 GB is used. I assume that it is only using CPU and not GPU. CUDAExecutionProvider
is using the GPU but I can't use it because I'm running into out of memory due to fp32 (I have a 24 GB GPU and need to use fp16). How can I check if TensorrtExecutionProvider
is using GPU? I was following the guidelines https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu
Thank you, it is unfortunate you run into these issues. I will have a look this week and extend the documentation with benchmarks. Something I want to do is to provide a reproducible benchmark script to track memory usage, speed. I agree doing a release with new features without battle testing them is not good.
I'm using Cuda 11.6 with cuDNN 8.7.0. Why is there a warning that TensorRT was linked against cuDNN 8.6.0 but loaded cuDNN 8.3.2? How can I let it use cuDNN 8.7.0?
I am not sure. I would recommend you to use docker, for example I use nvcr.io/nvidia/tensorrt:22.08-py3
from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt .
It takes around 10 min to load the model. I assume this is because the model is optimized while loading (and cast to fp16). Is it possible to speed up model loading?
Do you mean just creating the InferenceSession? Have you tried running trtexec
to see if it takes as long? I will look into this as well.
I am not sure. I would recommend you to use docker, for example I use nvcr.io/nvidia/tensorrt:22.08-py3 from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt .
I never used docker before. Does this work on Windows? How can I run my script in this docker container? In addition, my script requires a lot of Python packages. Can I install them in the docker somehow?
Do you mean just creating the InferenceSession? Have you tried running trtexec to see if it takes as long? I will look into this as well.
Creating the InferenceSession took that long for TensorrtExecutionProvider
. On the other hand, for CUDAExecutionProvider
it is very fast. I think the problem is described here.
@Eichhof I recommend using Docker because having CUDA + cuDNN + TensorRT versions match is kind of painful, and Nvidia provides docker images that are straightforward to use. I think it should work on Windows as well!
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I Installed optimum with
python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime-gpu]
. Then I was runningpython -m optimum.exporters.onnx --atol=1e-4 --for-ort --task causal-lm-with-past --model EleutherAI/gpt-j-6B gptj_onnx/
to transform GPT-J to ONNX. To use the model, I used the following code:The output is then the following:
I'm facing several issues:
You are calling .generate() with the
input_idsbeing on a device type different than your model's device.
provider_options=dict(trt_fp16_enable=True)
and the newuse_merged=True
, the model takes around 40 GB of CPU memory. Is this expected?TensorrtExecutionProvider
?Expected behavior
TensorrtExecutionProvider
.