pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.61k stars 351 forks source link

🐛 [Bug] Errors building Torch-TensorRT on Jetson #2623

Open airalcorn2 opened 10 months ago

airalcorn2 commented 10 months ago

Bug Description

When trying to build Torch-TensorRT on a Jetson following the instructions here, I get errors that seem to be related to changes made to WORSPACE.jp50 in this commit. The first error I get is:

ERROR: Traceback (most recent call last):
    File "/TensorRT/WORKSPACE", line 6, column 10, in <toplevel>
        workspace(name = "Torch-TensorRT")
Error in workspace: workspace() function should be used only at the top of the WORKSPACE file

which is caused by this line. When I remove that line, I get the following error:

ERROR: /TensorRT/WORKSPACE:95:1: name 'pip_install' is not defined

You can see in the commit that:

load("@rules_python//python:pip.bzl", "pip_install")

was removed, but pip_install is still used in the current WORKSPACE.jp50 (here). In contrast, WORKSPACE uses pip_parse (here). At this point, I just replaced WORKSPACE.jp50 with the older version, which got me much further, but I eventually met a different error:

/usr/bin/ld: cannot find -ltorchtrt

To Reproduce

Follow the Torch-TensorRT building instructions under "Building Natively on aarch64 (Jetson)" here.

Expected behavior

No errors related to the WORKSPACE.jp50 file.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

A Jetson Xavier AGX and the dustynv/ros:iron-pytorch-l4t-r35.3.1 container base image.

Additional context

airalcorn2 commented 10 months ago

Re: this error:

/usr/bin/ld: cannot find -ltorchtrt

The failing command was:

aarch64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/register_tensorrt_classes.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/tensorrt_backend.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/tensorrt_classes.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/torch_tensorrt_py.o -L/TensorRT/py/torch_tensorrt/lib/ -L/opt/conda/lib/python3.6/config-3.6m-x86_64-linux-gnu -L/usr/local/lib/python3.8/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib -ltorchtrt -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-38/torch_tensorrt/_C.cpython-38-aarch64-linux-gnu.so -Wno-deprecated -Wno-deprecated-declarations -Wl,--no-as-needed -ltorchtrt -Wl,-rpath,$ORIGIN/lib -lpthread -ldl -lutil -lrt -lm -Xlinker -export-dynamic -D_GLIBCXX_USE_CXX11_ABI=1

I did:

cp bazel-bin/libtorchtrt.tar.gz .
tar -xzvf libtorchtrt.tar.gz
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/TensorRT/torch_tensorrt/lib

and then ran:

aarch64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/register_tensorrt_classes.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/tensorrt_backend.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/tensorrt_classes.o /TensorRT/build/temp.linux-aarch64-cpython-38/py/torch_tensorrt/csrc/torch_tensorrt_py.o -L/TensorRT/py/torch_tensorrt/lib/ -L/opt/conda/lib/python3.6/config-3.6m-x86_64-linux-gnu -L/usr/local/lib/python3.8/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib -L/TensorRT/torch_tensorrt/lib -ltorchtrt -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-38/torch_tensorrt/_C.cpython-38-aarch64-linux-gnu.so -Wno-deprecated -Wno-deprecated-declarations -Wl,--no-as-needed -ltorchtrt -Wl,-rpath,$ORIGIN/lib -lpthread -ldl -lutil -lrt -lm -Xlinker -export-dynamic -D_GLIBCXX_USE_CXX11_ABI=1

and that command could run (notice the addition of -L/TensorRT/torch_tensorrt/lib), but I don't know where to go next.

airalcorn2 commented 10 months ago

I went back to using the current version of WORKSPACE.jp50 and made a couple changes, including commenting out the pip_install as suggested in the comment above that step. Using the attached Dockerfile and WORKSPACE.jp50 files, I was able to successfully build torch_tensorrt. However, when trying to import torch_tensorrt I received a new error of:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/site-packages/torch_tensorrt/__init__.py", line 84, in <module>
    from torch_tensorrt._compile import *  # noqa: F403
  File "/usr/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 24, in <module>
    from torch._export import ExportedProgram
ImportError: cannot import name 'ExportedProgram' from 'torch._export' (/usr/local/lib/python3.8/dist-packages/torch/_export/__init__.py)

so it appears my hack in the Dockerfile of:

RUN sed -i 's/2.1.dev/2.1/g' py/torch_tensorrt/__init__.py

was ill-advised. Is there a way to make Torch-TensorRT work with PyTorch 2.1.0? PyTorch 2.2 is currently only available for JetPack 6.0.

WORKSPACE.jp50 Dockerfile

airalcorn2 commented 10 months ago

I followed the instructions under "Build from Source" here and that seemed to work. I used the attached Dockerfile and WORKSPACE. Once torch_tensorrt is installed, you have to use:

export PYTHONPATH=${PYTHONPATH}:/usr/lib/python3.8/site-packages

to be able to import it. I'm leaving this issue open because of the WORKSPACE.jp50 bug.

Dockerfile WORKSPACE