Slow generation and high memory usage for GPT-J

Eichhof commented 1 year ago

System Info

optimum @ git+https://github.com/huggingface/optimum.git@e156282ffd8587df8422bdf53295880ae881b353
Python: 3.10.4
Platform: Windows 10
Cuda: 11.6
cuDNN: 8.7.0.

Who can help?

@JingyaHuang @echarlaix

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I Installed optimum with python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime-gpu]. Then I was running python -m optimum.exporters.onnx --atol=1e-4 --for-ort --task causal-lm-with-past --model EleutherAI/gpt-j-6B gptj_onnx/ to transform GPT-J to ONNX. To use the model, I used the following code:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token=gpt_eos, eos_token=gpt_eos, truncation_side='left')
model = ORTModelForCausalLM.from_pretrained(
            "C:/Users/myUsername/Desktop/gptj_onnx",
            provider="TensorrtExecutionProvider",
           provider_options=dict(trt_fp16_enable=True),
           use_merged=True
)

prompt = tokenizer(text, return_tensors='pt', truncation="only_first", max_length=self.max_length)
prompt = {key: value for key, value in prompt.items()}
out = model.generate(**prompt,
         min_length=16,
         max_new_tokens=40,
          do_sample=True,
          top_k=35,
          top_p=0.6,
          temperature=1,
          no_repeat_ngram_size=4,
          use_cache=True,
          pad_token_id=tokenizer.eos_token_id,
)
res = tokenizer.decode(out[0])

The output is then the following:

2023-02-18 12:24:44.4692281 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:24:44 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
2023-02-18 12:24:44.4859866 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:24:44 WARNING] hDebInfo\_deps\onnx_tensorrt-src\onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
2023-02-18 12:24:58.6815842 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:24:58 WARNING] hDebInfo\_deps\onnx_tensorrt-src\onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
2023-02-18 12:30:09.2244893 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:30:09 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
2023-02-18 12:30:09.7332640 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:30:09 WARNING] hDebInfo\_deps\onnx_tensorrt-src\onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
2023-02-18 12:33:47.0862867 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:33:47 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
2023-02-18 12:33:54.0530889 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:33:54 WARNING] hDebInfo\_deps\onnx_tensorrt-src\onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
2023-02-18 12:39:41.4764061 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:39:41 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
2023-02-18 12:39:47.7026424 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:39:47 WARNING] hDebInfo\_deps\onnx_tensorrt-src\onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\transformers\generation\utils.py:1359: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
2023-02-18 12:56:27.6892311 [W:onnxruntime:Default, tensorrt_execution_provider.h:63 onnxruntime::TensorrtLogger::log] [2023-02-18 11:56:27 WARNING] TensorRT was linked against cuDNN 8.6.0 but loaded cuDNN 8.3.2

I'm facing several issues:

Without ONNX it takes around 1s to generate 40 tokens on GPU. With the above code, it takes much longer. Is there an error in my code? I think, it might be related to the warning You are calling .generate() with theinput_idsbeing on a device type different than your model's device.
Although I'm using provider_options=dict(trt_fp16_enable=True)and the new use_merged=True, the model takes around 40 GB of CPU memory. Is this expected?
Why is there a warning that CUDA lazy loading is not enabled when I'm using TensorrtExecutionProvider?
I'm using Cuda 11.6 with cuDNN 8.7.0. Why is there a warning that TensorRT was linked against cuDNN 8.6.0 but loaded cuDNN 8.3.2? How can I let it use cuDNN 8.7.0?
It takes around 10 min to load the model. I assume this is because the model is optimized while loading (and cast to fp16). Is it possible to speed up model loading? Without ONNX it takes only around 10 seconds to load the model.

Expected behavior

Faster generation (<1s for 40 tokens).
Less CPU memory usage (< 40 GB).
No warning that CUDA lazy loading is not enabled when I'm using TensorrtExecutionProvider.
Usage of cuDNN 8.7.0 and no warning.
Faster model loading.

Eichhof commented 1 year ago

From GPU only 2 GB is used. I assume that it is only using CPU and not GPU. CUDAExecutionProvider is using the GPU but I can't use it because I'm running into out of memory due to fp32 (I have a 24 GB GPU and need to use fp16). How can I check if TensorrtExecutionProvider is using GPU? I was following the guidelines https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu

fxmarty commented 1 year ago

Thank you, it is unfortunate you run into these issues. I will have a look this week and extend the documentation with benchmarks. Something I want to do is to provide a reproducible benchmark script to track memory usage, speed. I agree doing a release with new features without battle testing them is not good.

I'm using Cuda 11.6 with cuDNN 8.7.0. Why is there a warning that TensorRT was linked against cuDNN 8.6.0 but loaded cuDNN 8.3.2? How can I let it use cuDNN 8.7.0?

I am not sure. I would recommend you to use docker, for example I use nvcr.io/nvidia/tensorrt:22.08-py3 from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt .

It takes around 10 min to load the model. I assume this is because the model is optimized while loading (and cast to fp16). Is it possible to speed up model loading?

Do you mean just creating the InferenceSession? Have you tried running trtexec to see if it takes as long? I will look into this as well.

Eichhof commented 1 year ago

I am not sure. I would recommend you to use docker, for example I use nvcr.io/nvidia/tensorrt:22.08-py3 from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt .

I never used docker before. Does this work on Windows? How can I run my script in this docker container? In addition, my script requires a lot of Python packages. Can I install them in the docker somehow?

Do you mean just creating the InferenceSession? Have you tried running trtexec to see if it takes as long? I will look into this as well.

Creating the InferenceSession took that long for TensorrtExecutionProvider. On the other hand, for CUDAExecutionProvider it is very fast. I think the problem is described here.

fxmarty commented 1 year ago

@Eichhof I recommend using Docker because having CUDA + cuDNN + TensorRT versions match is kind of painful, and Nvidia provides docker images that are straightforward to use. I think it should work on Windows as well!

huggingface / optimum