Closed svenevs closed 2 years ago
Thanks for reporting this. We are aware of the eval_util issue, its fix is already included in NGC containers which use snapshots of PyTorch master. The fix for our master branch is staged here https://github.com/NVIDIA/Torch-TensorRT/pull/736/files and will be merged when we update to PyTorch 1.11. @andi4191 Have you hit the linear decomposition issue in the NGC branches?
For your bazel issue. I am a bit confused. Did you use the TensorRT tarball (which should be distro agnostic) directly, like reference the unpacked tarball directly in your WORKSPACE (i.e. the http_archive
entries) or did you install or unpack the tarball (or maybe used an rpm or something) in your system and then use the local install path? I think {prefix}/include/{file}
might be a good add for the local BUILD files for TRT and cuDNN.
Also as an aside, if you are using a newer than released version of PyTorch, I think the release/ngc/*
branches will help you get going faster. We cut a branch each month which is intended to be part of that month's NGC container and it should have updates for newer PyTorch support. (We don't merge these into master since PyTorch may change an API multiple times before release but relevant changes get cherrypicked through PRs like the one I linked and staged until the next PyTorch comes out).
Thanks for reporting this. We are aware of the eval_util issue, its fix is already included in NGC containers which use snapshots of PyTorch master. The fix for our master branch is staged here https://github.com/NVIDIA/Torch-TensorRT/pull/736/files and will be merged when we update to PyTorch 1.11. @andi4191 Have you hit the linear decomposition issue in the NGC branches?
For your bazel issue. I am a bit confused. Did you use the TensorRT tarball (which should be distro agnostic) directly, like reference the unpacked tarball directly in your WORKSPACE (i.e. the
http_archive
entries) or did you install or unpack the tarball (or maybe used an rpm or something) in your system and then use the local install path? I think{prefix}/include/{file}
might be a good add for the local BUILD files for TRT and cuDNN.
I've hit the linear decomposition issue in the containers before. It's due to simple API changes. I've added support for pyt_nightly (based on Nov) which has both of these fixes. https://github.com/NVIDIA/Torch-TensorRT/commit/38f968ac68001a00a138923466f645c7c8de0ae2. We have it available in release/ngc/22.02
branch
@peri044 Can you stage a PR which cherry picks this change for master and tag it with the upstreaming tag?
Hey sorry for the slow response! Thanks for following up, glad to see the fixes are already in. Thanks for explaining the staging structure, I should have searched different PRs and branches better! AFAICT this issue can be closed then :slightly_smiling_face:
RE tarball archive format, just re-downloaded them and am pasting the structure to hopefully help save time. If there's a way to enable both paths in the bazel setup I think that would be ideal. But again it's not particularly problematic because it's expected that users who have local installs are going to manipulate bazel no matter what, and the error messages about missing header or libs are fairly straightforward.
Just include/
and lib
, the install instructions encourage you to copy it to your CUDA install but it's also possible to just update CPATH
and LD_LIBRARY_PATH
and keep it in e.g., /opt/cudnn
or something.
$ tree -d cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive
cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive
├── include
└── lib
2 directories
It includes a targets
infrastructure, as well as symlyinks to top-level bin
, include
, and lib
. The extracted folder for install should just get moved to somewhere permanent, e.g., /opt/tensorrt
or what have you. Deleted doc and samples from output since they are not relevant here.
$ tree -d TensorRT-8.2.3.0
TensorRT-8.2.3.0
├── bin -> targets/x86_64-linux-gnu/bin
├── data
│ ├── char-rnn
│ │ └── model
│ ├── faster-rcnn
│ ├── googlenet
│ ├── int8_api
│ ├── mnist
│ ├── resnet50
│ └── ssd
│ └── batches
├── doc
...
├── graphsurgeon
├── include
├── lib -> targets/x86_64-linux-gnu/lib
├── onnx_graphsurgeon
├── python
├── samples
...
├── targets
│ └── x86_64-linux-gnu
│ ├── bin
│ ├── include -> ../../include
│ ├── lib
│ │ └── stubs
│ └── samples -> ../../samples
└── uff
104 directories
Bug Description
Hit two compilation bugs with an unreleased version of pytorch (commit compiled linked below) on an almost certainly "unsupported" configuration. Figured I'd share the changes that got me to succeed locally (though I'm not sure if they are valid, haven't figured out how to used my compile success yet :slightly_smiling_face:):
eval_util.cpp
it was failing oninvalid initialization of reference of type 'const c10::Type&' from expression of type 'std::shared_ptr<c10::Type>'
. Could be gcc being picky?linear_to_addmm.cpp
, it was sayingtorch::jit::Function
has no membergraph
. I just snagged this function https://github.com/pytorch/pytorch/blob/c5fe70021cc3498a1b0a2e7ed44e724cd6b1e4e7/torch/csrc/jit/api/function_impl.h#L164 but probably some kind of error check is needed.To Reproduce
Try and build a newer version of pytorch. See below for pytorch commit I was using. I don't believe the use of cuda 11.5 or any local installed cudnn / tensorrt are relevant here, they just seem like api changes?
Expected behavior
Things compile without error. But again, this is all unreleased stuff...
Environment
conda
,pip
,libtorch
, source): from sourcespack
to install it, no record of the exact commandAdditional context
The fixes seem reasonable, not sure why they were needed since I don't know the api well enough. If the api has changed, not really sure when you'd want to land these changes so feel free to close this issue.
Unrelated, but worth mentioning when finessing bazel, not sure if the tarball behaves differently on different linux platforms, but at least with TensoRT 8.2.2.1 there was no
x86_64-linux-gnu
prefix, same for cudnn [but I did a tarball install to/usr/local/cuda-11.5
]). I think the ubuntu apt installers do that. I suppose it could be documented but realistically the inability to find{prefix}/x86_64-linux-gnu/NvInfer.h
is pretty self explanatory and the docs already make it clear you're going to have to change a bunch for local installs. I bring it up because if bazel can be allowed to search for both{prefix}/include/{file}
as well as{prefix}/x86_64-linux-gnu/{thing}
then that would be nice to add -- I don't know bazel well enough. Affects tensorrt and cudnn for me locally.