pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.5k stars 344 forks source link

🐛 [Bug] error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr #2827

Open geraldstanje opened 4 months ago

geraldstanje commented 4 months ago

Bug Description

hi i see the following error - it looks like the torch.compile worked fine but when i invoke the prediction after that it errors out:

[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[W] Unable to determine GPU memory usage
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [W] Unable to determine GPU memory usage
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1104, GPU 0 (MiB)
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [W] CUDA initialization failure with error: 35. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
predict_fn error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr

does pytorch-tensorrt work with a g4dn.xlarge? why i get this: CUDA initialization failure with error: 35?

full log: tensorrt_torch_error.txt

To Reproduce

Steps to reproduce the behavior:

  1. build container with tensorrt
    
    # use sagemaker DLC
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker

Install additional dependencies

RUN python -m pip install torch torch-tensorrt tensorrt --extra-index-ur https://download.pytorch.org/whl/cu118


how was the model compiled?

model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, backend="torch_tensorrt", dynamic=False, options={"truncate_long_and_double": True, "precision": torch.half, "debug": True, "min_block_size": 1, "optimization_level": 4, "use_python_runtime": False})


to rule out that the issue is somewhere else - i tested with the following torch.compile - this works fine:

model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, mode="reduce-overhead")



should i try some other settings for torch.compile(model.model_body[0].auto_model, backend="torch_tensorrt" ?

could the error be related to https://github.com/NVIDIA/TensorRT/issues/308 ?

<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->

## Expected behavior

no error

<!-- A clear and concise description of what you expected to happen. -->

## Environment

> Build information about Torch-TensorRT can be found by turning on debug messages

 - Torch-TensorRT Version (e.g. 1.0.0):
 - PyTorch Version (e.g. 1.0): 2.1
 - CPU Architecture: g4dn.xlarge
 - OS (e.g., Linux):
 - How you installed PyTorch (`conda`, `pip`, `libtorch`, source):
 - Build command you used (if compiling from source):
 - Are you using local sources or building from archives:
 - Python version:
 - CUDA version:
 - GPU models and configuration:
 - Any other relevant information:

## Additional context

<!-- Add any other context about the problem here. -->
narendasan commented 4 months ago

Can you share something like the NVIDIA-SMI print out that can show us the driver version and status?

geraldstanje commented 4 months ago

@narendasan sure. in the meantime where can i check compatibility of cuda driver, pytorch version, pytorch/TensorRT version etc.?

narendasan commented 4 months ago

For PyTorch vs Torch-TensorRT compatibility, the versions are aligned, so PyTorch v2.2.0 <-> Torch-TensorRT v2.2.0 (prior to PyTorch 2.0, it would be something like PyTorch 1.13 <-> Torch-TensorRT 1.3.0). For driver compatibility this is based on CUDA https://docs.nvidia.com/deploy/cuda-compatibility/index.html. So if your PyTorch build targets CUDA 11.8 you need >= 450.80.02. If you are using a 12.1 PyTorch then you need to use >=525.60.13. NVIDIA-SMI can help you determine if your CUDA and CUDA-Driver are aligned.

geraldstanje commented 3 months ago

@narendasan i tried it with:

and get same error - is that expected?

tanzelin430 commented 2 months ago

@geraldstanje I tried the resnet example in https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/torch_compile_resnet_example.html with : | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.8 | The GPU is Nvidia-A100 80G and run nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

and the pip list show that:

Package                  Version
------------------------ ------------
certifi                  2024.6.2
charset-normalizer       3.3.2
filelock                 3.15.4
fsspec                   2024.6.1
huggingface-hub          0.23.4
idna                     3.7
Jinja2                   3.1.4
joblib                   1.4.2
MarkupSafe               2.1.5
mpmath                   1.3.0
networkx                 3.2.1
numpy                    1.25.2
nvidia-cublas-cu11       11.11.3.6
nvidia-cublas-cu12       12.5.3.2
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.5.82
nvidia-cudnn-cu11        8.7.0.84
nvidia-cudnn-cu12        9.1.1.17
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.3.0.86
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusparse-cu11     11.7.5.86
nvidia-nccl-cu11         2.19.3
nvidia-nvtx-cu11         11.8.86
onnx                     1.16.1
packaging                24.1
pillow                   10.3.0
pip                      24.0
protobuf                 5.27.2
PyYAML                   6.0.1
regex                    2024.5.15
requests                 2.32.3
safetensors              0.4.3
scikit-learn             1.5.0
scipy                    1.13.1
sentence-transformers    3.0.1
setuptools               69.5.1
sympy                    1.12.1
tensorrt                 8.6.1.post1
tensorrt-bindings        8.6.1
tensorrt-libs            8.6.1
threadpoolctl            3.5.0
tokenizers               0.19.1
torch                    2.2.0+cu118
torch-tensorrt           2.2.0+cu118
torchvision              0.17.0+cu118
tqdm                     4.66.4
transformers             4.42.3
triton                   2.2.0
typing_extensions        4.12.2
urllib3                  2.2.2
wheel                    0.43.0

have you or anyone else fixed this bug? Please let me know, thank you very much!