Closed Eichhof closed 1 year ago
Hi @Eichhof,
Does it work with the CUDAExecutionProvider
?
Hi @Eichhof, Can you also check your TensorRT installation with the steps in our doc and give us the version you are using? Thx.
Hi @Eichhof , just saw this one in ONNX Runtime issues, I'm wondering if it could be related: https://github.com/microsoft/onnxruntime/issues/14063
Thank you very much for the hints. I will test it in the next two days and let you know.
Hi @michaelbenayoun and @JingyaHuang
Thank you very much for your help with my problem. I tried CUDAExecutionProvider
and the error does not appear anymore, thus I have to check my TensorRT installation. But when using CUDAExecutionProvider
, I'm getting the out-of-memory error shown below. I think it is due to the usage of fp32. Is it possible to use fp16?
2022-12-29 13:37:33.3532630 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2022-12-29 13:37:33.3593281 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2022-12-29 13:38:39.0337837 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2022-12-29 13:38:39.0417407 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2022-12-29 13:38:39.5541455 [E:onnxruntime:, inference_session.cc:1500 onnxruntime::InferenceSession::Initialize::<lambda_d67cde18891e9d311739162a2b4aba6d>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\framework\bfc_arena.cc:342 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 67108864
Traceback (most recent call last):
File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 242, in <module>
model = Model_init()
File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 48, in Model_init
model = Model(gradient_checkpointing=False, start_prompt=start_prompt)
File "C:\Users\myUsername\PycharmProjects\chatbot\server\../../chatbot\gpt_j\model.py", line 61, in __init__
self.model = ORTModelForCausalLM.from_pretrained(
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 552, in from_pretrained
return super().from_pretrained(
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\modeling_base.py", line 325, in from_pretrained
return from_pretrained_method(
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 565, in _from_pretrained
model = cls.load_model(
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 449, in load_model
decoder_with_past_session = onnxruntime.InferenceSession(
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 347, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 395, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: D:\a\_work\1\s\onnxruntime\core\framework\bfc_arena.cc:342 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 67108864
Now TensorrtExecutionProvider
works with the correct installation but it fails when I try to provide provider_options=dict(trt_fp16_enable=1)
to enable FP16. Why?
In addition, I'm also getting the same out-of-memory error as above. Probably with FP16 this problem would be solved.
Finally, I'm getting also the warning CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
and I'm getting tons of the following warnings
2022-12-29 15:05:24.0568129 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
2022-12-29 15:05:24.7410563 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:395: One or more weights outside the range of INT32 was clamped
@Eichhof Sorry you encounter all those issues. I hope we can really improve the support for TensorRT in the coming days/weeks.
Do you get a
EP Error using ['TensorrtExecutionProvider', 'CUDAExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
when passing provider_options=dict(trt_fp16_enable=1)
to the from_pretrained()
? I at least do, and submitted a fix in https://github.com/huggingface/optimum/pull/653
You can safely ignore the warnings:
2022-12-29 15:05:24.0568129 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
2022-12-29 15:05:24.7410563 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:395: One or more weights outside the range of INT32 was clamped
I recommend you to read the issue https://github.com/huggingface/optimum/issues/636 if you are using gpt2/gpt-j or alike, it's an issue in transformers and I'll fix ASAP as well.
@Eichhof did you manage to solve the issue LoadLibrary failed with error 126
? In which case, I would prefer to close this issue and open an other one.
@fxmarty Yes, I'm getting exactly this warning when passing provider_options=dict(trt_fp16_enable=1)
. When will the fix be incorporated in a new release?
In Transformers, I'm using low_cpu_mem_usage
. Is this also available here?
Do you recommend Cuda lazy loading?
Yes, the error LoadLibrary failed with error 126
is solved. The problem was that TensorRT was not correctly installed.
Hi, the PR is ready, and should be merged soon in main
.
Unfortunately low_cpu_mem_usage
is not available when using Optimum/ONNX Runtime.
For Cuda lazy loading, I'm not sure. Given that you get the warning I mentioned above, it's likely CUDAExecutionProvider is actually used.
I'll close this issue for now then, feel free to open one for the cuda lazy loading warning message!
@fxmarty I'm still waiting for the merge of the PR. Do you have any updates when this will be the case?
Hi @Eichhof , it is merged: https://github.com/huggingface/optimum/pull/653 and you should be able to pass provider_options=dict(trt_fp16_enable=1)
. But you will need to use the version from main
for this to work, there hasn't been a release yet.
If there is any other problem you encounter, feel free to open an issue, it's helpful for us to improve the lib and keep track!
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I Installed optimum with
pip install optimum[onnxruntime-gpu]
. Then I was runningpython -m optimum.exporters.onnx --task causal-lm-with-past --model EleutherAI/gpt-j-6B gptj_onnx/
to transform GPT-J to ONNX. To use the model, I used the following lines:When running these lines of code, I'm getting the following error:
I have installed Cuda 11.6 and also cuDNN 8.7.0.
Expected behavior
The model should load correctly without an error.