huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.51k stars 449 forks source link

(ONNXRuntimeError) LoadLibrary failed with error 126 #618

Closed Eichhof closed 1 year ago

Eichhof commented 1 year ago

System Info

Optimum: 1.5.1
Python: 3.10.4
Platform: Windows 10
Cuda: 11.6

Who can help?

@JingyaHuang @echarlaix

Information

Tasks

Reproduction

I Installed optimum with pip install optimum[onnxruntime-gpu]. Then I was running python -m optimum.exporters.onnx --task causal-lm-with-past --model EleutherAI/gpt-j-6B gptj_onnx/ to transform GPT-J to ONNX. To use the model, I used the following lines:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("C:/Users/myUsername/Desktop/gptj_onnx", pad_token=gpt_eos, eos_token=gpt_eos, truncation_side='left')
model = ORTModelForCausalLM.from_pretrained(
            "C:/Users/myUsername/Desktop/gptj_onnx",
            provider="TensorrtExecutionProvider",
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            use_cache=True,
            gradient_checkpointing=gradient_checkpointing
)

When running these lines of code, I'm getting the following error:

Traceback (most recent call last):
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 349, in <module>
    model = Model_init()
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 166, in Model_init
    model = Model(gradient_checkpointing=False, start_prompt=start_prompt)
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\../../chatbot\gpt_j\model.py", line 58, in __init__
    self.model = ORTModelForCausalLM.from_pretrained(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 269, in from_pretrained
    return super().from_pretrained(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\modeling_base.py", line 266, in from_pretrained
    return cls._from_pretrained(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 324, in _from_pretrained
    model = ORTModel.load_model(os.path.join(model_id, subfolder, model_file_name), **kwargs)
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 216, in load_model
    return ort.InferenceSession(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 395, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1069 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_providers_tensorrt.dll"

I have installed Cuda 11.6 and also cuDNN 8.7.0.

Expected behavior

The model should load correctly without an error.

michaelbenayoun commented 1 year ago

Hi @Eichhof, Does it work with the CUDAExecutionProvider?

JingyaHuang commented 1 year ago

Hi @Eichhof, Can you also check your TensorRT installation with the steps in our doc and give us the version you are using? Thx.

fxmarty commented 1 year ago

Hi @Eichhof , just saw this one in ONNX Runtime issues, I'm wondering if it could be related: https://github.com/microsoft/onnxruntime/issues/14063

Eichhof commented 1 year ago

Thank you very much for the hints. I will test it in the next two days and let you know.

Eichhof commented 1 year ago

Hi @michaelbenayoun and @JingyaHuang Thank you very much for your help with my problem. I tried CUDAExecutionProvider and the error does not appear anymore, thus I have to check my TensorRT installation. But when using CUDAExecutionProvider, I'm getting the out-of-memory error shown below. I think it is due to the usage of fp32. Is it possible to use fp16?

2022-12-29 13:37:33.3532630 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2022-12-29 13:37:33.3593281 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2022-12-29 13:38:39.0337837 [W:onnxruntime:, session_state.cc:1030 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2022-12-29 13:38:39.0417407 [W:onnxruntime:, session_state.cc:1032 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2022-12-29 13:38:39.5541455 [E:onnxruntime:, inference_session.cc:1500 onnxruntime::InferenceSession::Initialize::<lambda_d67cde18891e9d311739162a2b4aba6d>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\framework\bfc_arena.cc:342 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 67108864

Traceback (most recent call last):
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 242, in <module>
    model = Model_init()
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\server.py", line 48, in Model_init
    model = Model(gradient_checkpointing=False, start_prompt=start_prompt)
  File "C:\Users\myUsername\PycharmProjects\chatbot\server\../../chatbot\gpt_j\model.py", line 61, in __init__
    self.model = ORTModelForCausalLM.from_pretrained(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_ort.py", line 552, in from_pretrained
    return super().from_pretrained(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\modeling_base.py", line 325, in from_pretrained
    return from_pretrained_method(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 565, in _from_pretrained
    model = cls.load_model(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\optimum\onnxruntime\modeling_decoder.py", line 449, in load_model
    decoder_with_past_session = onnxruntime.InferenceSession(
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Users\myUsername\Anaconda3\envs\huggingface\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 395, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: D:\a\_work\1\s\onnxruntime\core\framework\bfc_arena.cc:342 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 67108864
Eichhof commented 1 year ago

Now TensorrtExecutionProvider works with the correct installation but it fails when I try to provide provider_options=dict(trt_fp16_enable=1) to enable FP16. Why?

In addition, I'm also getting the same out-of-memory error as above. Probably with FP16 this problem would be solved.

Finally, I'm getting also the warning CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars and I'm getting tons of the following warnings

2022-12-29 15:05:24.0568129 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
2022-12-29 15:05:24.7410563 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:395: One or more weights outside the range of INT32 was clamped
fxmarty commented 1 year ago

@Eichhof Sorry you encounter all those issues. I hope we can really improve the support for TensorRT in the coming days/weeks.

Do you get a

EP Error using ['TensorrtExecutionProvider', 'CUDAExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.

when passing provider_options=dict(trt_fp16_enable=1) to the from_pretrained()? I at least do, and submitted a fix in https://github.com/huggingface/optimum/pull/653

You can safely ignore the warnings:

2022-12-29 15:05:24.0568129 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
2022-12-29 15:05:24.7410563 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 onnxruntime::TensorrtLogger::log] [2022-12-29 14:05:24 WARNING] external\onnx-tensorrt\onnx2trt_utils.cpp:395: One or more weights outside the range of INT32 was clamped

I recommend you to read the issue https://github.com/huggingface/optimum/issues/636 if you are using gpt2/gpt-j or alike, it's an issue in transformers and I'll fix ASAP as well.

fxmarty commented 1 year ago

@Eichhof did you manage to solve the issue LoadLibrary failed with error 126? In which case, I would prefer to close this issue and open an other one.

Eichhof commented 1 year ago

@fxmarty Yes, I'm getting exactly this warning when passing provider_options=dict(trt_fp16_enable=1). When will the fix be incorporated in a new release?

In Transformers, I'm using low_cpu_mem_usage. Is this also available here?

Do you recommend Cuda lazy loading?

Yes, the error LoadLibrary failed with error 126 is solved. The problem was that TensorRT was not correctly installed.

fxmarty commented 1 year ago

Hi, the PR is ready, and should be merged soon in main.

Unfortunately low_cpu_mem_usage is not available when using Optimum/ONNX Runtime.

For Cuda lazy loading, I'm not sure. Given that you get the warning I mentioned above, it's likely CUDAExecutionProvider is actually used.

I'll close this issue for now then, feel free to open one for the cuda lazy loading warning message!

Eichhof commented 1 year ago

@fxmarty I'm still waiting for the merge of the PR. Do you have any updates when this will be the case?

fxmarty commented 1 year ago

Hi @Eichhof , it is merged: https://github.com/huggingface/optimum/pull/653 and you should be able to pass provider_options=dict(trt_fp16_enable=1) . But you will need to use the version from main for this to work, there hasn't been a release yet.

If there is any other problem you encounter, feel free to open an issue, it's helpful for us to improve the lib and keep track!