Open geraldstanje opened 4 months ago
Can you share something like the NVIDIA-SMI print out that can show us the driver version and status?
@narendasan sure. in the meantime where can i check compatibility of cuda driver, pytorch version, pytorch/TensorRT version etc.?
For PyTorch vs Torch-TensorRT compatibility, the versions are aligned, so PyTorch v2.2.0 <-> Torch-TensorRT v2.2.0 (prior to PyTorch 2.0, it would be something like PyTorch 1.13 <-> Torch-TensorRT 1.3.0). For driver compatibility this is based on CUDA https://docs.nvidia.com/deploy/cuda-compatibility/index.html. So if your PyTorch build targets CUDA 11.8 you need >= 450.80.02. If you are using a 12.1 PyTorch then you need to use >=525.60.13. NVIDIA-SMI can help you determine if your CUDA and CUDA-Driver are aligned.
@narendasan i tried it with:
nvidia-smi:
+------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.8
|------------------------------+
| GPU Name Persistance-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=======================|
nvcc -V:
nvcc_output: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
cuda_11.8.r11.8/compiler.31833905_0
GPU: Nvidia Tesla T4
Torch v2.2.0
Torch-TensorRT v2.2.0
pip list output:
Package Version
--------------------------- --------------
aiohttp 3.9.5
aiosignal 1.3.1
aniso8601 9.0.1
ansi2html 1.9.1
archspec 0.2.2
arrow 1.3.0
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
awscli 1.32.108
blinker 1.8.2
boltons 23.1.1
boto3 1.34.108
botocore 1.34.108
Brotli 1.1.0
cached-property 1.5.2
captum 0.6.0
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
conda 23.11.0
conda-content-trust 0.2.0
conda-libmamba-solver 23.12.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
contourpy 1.2.1
cryptography 42.0.7
cycler 0.12.1
Cython 3.0.10
datasets 2.19.1
decorator 5.1.1
dill 0.3.8
distro 1.8.0
docutils 0.16
enum-compat 0.0.3
evaluate 0.4.2
exceptiongroup 1.2.1
executing 2.0.1
filelock 3.14.0
Flask 3.0.3
Flask-RESTful 0.3.10
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2024.3.1
h5py 3.11.0
huggingface-hub 0.23.1
idna 3.7
ipython 8.18.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.4
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 2.4
kiwisolver 1.4.5
libmambapy 1.5.5
mamba 1.5.5
MarkupSafe 2.1.5
matplotlib 3.9.0
matplotlib-inline 0.1.7
menuinst 2.0.1
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
nvgpu 0.10.0
nvidia-cublas-cu11 11.11.3.6
nvidia-cublas-cu12 12.5.2.13
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.5.39
nvidia-cudnn-cu11 8.7.0.84
nvidia-cudnn-cu12 9.1.1.17
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.19.3
nvidia-nvtx-cu11 11.8.86
opencv-python 4.9.0.80
packaging 23.2
pandas 2.2.2
parso 0.8.4
pexpect 4.9.0
pillow 10.3.0
pip 24.0
platformdirs 4.1.0
pluggy 1.3.0
prompt-toolkit 3.0.38
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 15.0.2
pyarrow-hotfix 0.6
pyasn1 0.6.0
pycosat 0.6.6
pycparser 2.21
Pygments 2.18.0
pynvml 11.5.0
pyOpenSSL 24.1.0
pyparsing 3.1.2
PySocks 1.7.1
python-dateutil 2.9.0
pytz 2024.1
PyYAML 6.0
regex 2024.5.15
requests 2.31.0
retrying 1.3.4
rsa 4.7.2
ruamel.yaml 0.18.5
ruamel.yaml.clib 0.2.7
s3transfer 0.10.1
safetensors 0.4.3
sagemaker-inference 1.10.1
sagemaker-pytorch-inference 2.0.23
scikit-learn 1.4.2
scipy 1.13.0
sentence-transformers 2.7.0
setfit 1.0.1
setuptools 68.2.2
six 1.16.0
stack-data 0.6.3
sympy 1.12
tabulate 0.9.0
tensorrt 8.6.1.post1
tensorrt-bindings 8.6.1
tensorrt-libs 8.6.1
termcolor 2.4.0
threadpoolctl 3.5.0
tokenizers 0.15.2
torch 2.2.0+cu118
torch-model-archiver 0.11.0
torch-tensorrt 2.2.0+cu118
torchaudio 2.2.0+cu118
torchdata 0.7.1+5e6f7b7
torchserve 0.11.0
torchtext 0.17.0+cu118
torchvision 0.17.0+cu118
tqdm 4.66.4
traitlets 5.14.3
transformers 4.37.2
triton 2.2.0
truststore 0.8.0
types-python-dateutil 2.9.0.20240316
typing_extensions 4.11.0
tzdata 2024.1
urllib3 1.26.18
wcwidth 0.2.13
Werkzeug 3.0.3
wheel 0.42.0
xxhash 3.4.1
yarl 1.9.4
zstandard 0.22.0
and get same error - is that expected?
@geraldstanje I tried the resnet example in https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/torch_compile_resnet_example.html with :
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.8 |
The GPU is Nvidia-A100 80G
and run nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
and the pip list show that:
Package Version
------------------------ ------------
certifi 2024.6.2
charset-normalizer 3.3.2
filelock 3.15.4
fsspec 2024.6.1
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.4
joblib 1.4.2
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.2.1
numpy 1.25.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cublas-cu12 12.5.3.2
nvidia-cuda-cupti-cu11 11.8.87
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cuda-runtime-cu12 12.5.82
nvidia-cudnn-cu11 8.7.0.84
nvidia-cudnn-cu12 9.1.1.17
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.3.0.86
nvidia-cusolver-cu11 11.4.1.48
nvidia-cusparse-cu11 11.7.5.86
nvidia-nccl-cu11 2.19.3
nvidia-nvtx-cu11 11.8.86
onnx 1.16.1
packaging 24.1
pillow 10.3.0
pip 24.0
protobuf 5.27.2
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
scikit-learn 1.5.0
scipy 1.13.1
sentence-transformers 3.0.1
setuptools 69.5.1
sympy 1.12.1
tensorrt 8.6.1.post1
tensorrt-bindings 8.6.1
tensorrt-libs 8.6.1
threadpoolctl 3.5.0
tokenizers 0.19.1
torch 2.2.0+cu118
torch-tensorrt 2.2.0+cu118
torchvision 0.17.0+cu118
tqdm 4.66.4
transformers 4.42.3
triton 2.2.0
typing_extensions 4.12.2
urllib3 2.2.2
wheel 0.43.0
have you or anyone else fixed this bug? Please let me know, thank you very much!
Bug Description
hi i see the following error - it looks like the torch.compile worked fine but when i invoke the prediction after that it errors out:
does pytorch-tensorrt work with a g4dn.xlarge? why i get this:
CUDA initialization failure with error: 35
?full log: tensorrt_torch_error.txt
To Reproduce
Steps to reproduce the behavior:
Install additional dependencies
RUN python -m pip install torch torch-tensorrt tensorrt --extra-index-ur https://download.pytorch.org/whl/cu118
model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, backend="torch_tensorrt", dynamic=False, options={"truncate_long_and_double": True, "precision": torch.half, "debug": True, "min_block_size": 1, "optimization_level": 4, "use_python_runtime": False})
model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, mode="reduce-overhead")