🐛 [Bug] Possible inbound libtorch compilation failures

svenevs commented 2 years ago

Bug Description

Hit two compilation bugs with an unreleased version of pytorch (commit compiled linked below) on an almost certainly "unsupported" configuration. Figured I'd share the changes that got me to succeed locally (though I'm not sure if they are valid, haven't figured out how to used my compile success yet :slightly_smiling_face:):

diff --git a/core/conversion/evaluators/eval_util.cpp b/core/conversion/evaluators/eval_util.cpp
index dc050388..85bf6b73 100644
--- a/core/conversion/evaluators/eval_util.cpp
+++ b/core/conversion/evaluators/eval_util.cpp
@@ -238,7 +238,8 @@ at::Tensor createTensorFromList(
   /// Gets shape of tensor to be created
   auto sizes = compute_sizes(data);
   checkListInputType(elem_type, sizes.size() == 1 && sizes[0] == 0);
-  at::ScalarType initial_scalar_type = c10::scalarTypeFromJitType(elem_type);
+  at::ScalarType initial_scalar_type = c10::scalarTypeFromJitType(*elem_type);
   if (initial_scalar_type == at::ScalarType::Double) {
     initial_scalar_type = at::typeMetaToScalarType(c10::get_default_dtype());
   }
diff --git a/core/lowering/passes/linear_to_addmm.cpp b/core/lowering/passes/linear_to_addmm.cpp
index 15a035f5..372efa2f 100644
--- a/core/lowering/passes/linear_to_addmm.cpp
+++ b/core/lowering/passes/linear_to_addmm.cpp
@@ -34,7 +34,8 @@ void replaceLinearWithBiasNonePattern(std::shared_ptr<torch::jit::Graph> graph)
         continue;
       } else {
         torch::jit::WithInsertPoint guard(*it);
-        std::shared_ptr<torch::jit::Graph> d_graph = decompose_funcs.get_function("linear").graph();
+        std::shared_ptr<torch::jit::Graph> d_graph =
+          torch::jit::toGraphFunction(decompose_funcs.get_function("linear")).graph();
         torch::jit::Value* new_output = insertGraph(*it->owningGraph(), *d_graph, it->inputs()).at(0);
         new_output->setType(it->output()->type());
         it->output()->replaceAllUsesWith(new_output);

For eval_util.cpp it was failing on invalid initialization of reference of type 'const c10::Type&' from expression of type 'std::shared_ptr<c10::Type>'. Could be gcc being picky?
For linear_to_addmm.cpp, it was saying torch::jit::Function has no member graph. I just snagged this function https://github.com/pytorch/pytorch/blob/c5fe70021cc3498a1b0a2e7ed44e724cd6b1e4e7/torch/csrc/jit/api/function_impl.h#L164 but probably some kind of error check is needed.

To Reproduce

Try and build a newer version of pytorch. See below for pytorch commit I was using. I don't believe the use of cuda 11.5 or any local installed cudnn / tensorrt are relevant here, they just seem like api changes?

Expected behavior

Things compile without error. But again, this is all unreleased stuff...

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0): https://github.com/NVIDIA/Torch-TensorRT/commit/a465cc0fe622cffa5206447c6db50f48ba9e8728 (just building locally)
PyTorch Version (e.g. 1.0): https://github.com/pytorch/pytorch/tree/94abf120c82b613bcac1570b8910f8b7898885b6 (from source, cuda 11.5 support works there or likely thereafter).
CPU Architecture: x86_64
OS (e.g., Linux): Fedora 34
How you installed PyTorch (conda, pip, libtorch, source): from source
Build command you used (if compiling from source): used spack to install it, no record of the exact command
Are you using local sources or building from archives: local sources
Python version: 3.8.12
CUDA version: 11.5
Any other relevant information:
- gcc 11.0.1

Additional context

The fixes seem reasonable, not sure why they were needed since I don't know the api well enough. If the api has changed, not really sure when you'd want to land these changes so feel free to close this issue.

Unrelated, but worth mentioning when finessing bazel, not sure if the tarball behaves differently on different linux platforms, but at least with TensoRT 8.2.2.1 there was no x86_64-linux-gnu prefix, same for cudnn [but I did a tarball install to /usr/local/cuda-11.5]). I think the ubuntu apt installers do that. I suppose it could be documented but realistically the inability to find {prefix}/x86_64-linux-gnu/NvInfer.h is pretty self explanatory and the docs already make it clear you're going to have to change a bunch for local installs. I bring it up because if bazel can be allowed to search for both {prefix}/include/{file} as well as {prefix}/x86_64-linux-gnu/{thing} then that would be nice to add -- I don't know bazel well enough. Affects tensorrt and cudnn for me locally.

narendasan commented 2 years ago

Thanks for reporting this. We are aware of the eval_util issue, its fix is already included in NGC containers which use snapshots of PyTorch master. The fix for our master branch is staged here https://github.com/NVIDIA/Torch-TensorRT/pull/736/files and will be merged when we update to PyTorch 1.11. @andi4191 Have you hit the linear decomposition issue in the NGC branches?

For your bazel issue. I am a bit confused. Did you use the TensorRT tarball (which should be distro agnostic) directly, like reference the unpacked tarball directly in your WORKSPACE (i.e. the http_archive entries) or did you install or unpack the tarball (or maybe used an rpm or something) in your system and then use the local install path? I think {prefix}/include/{file} might be a good add for the local BUILD files for TRT and cuDNN.

narendasan commented 2 years ago

Also as an aside, if you are using a newer than released version of PyTorch, I think the release/ngc/* branches will help you get going faster. We cut a branch each month which is intended to be part of that month's NGC container and it should have updates for newer PyTorch support. (We don't merge these into master since PyTorch may change an API multiple times before release but relevant changes get cherrypicked through PRs like the one I linked and staged until the next PyTorch comes out).

peri044 commented 2 years ago

Thanks for reporting this. We are aware of the eval_util issue, its fix is already included in NGC containers which use snapshots of PyTorch master. The fix for our master branch is staged here https://github.com/NVIDIA/Torch-TensorRT/pull/736/files and will be merged when we update to PyTorch 1.11. @andi4191 Have you hit the linear decomposition issue in the NGC branches?

For your bazel issue. I am a bit confused. Did you use the TensorRT tarball (which should be distro agnostic) directly, like reference the unpacked tarball directly in your WORKSPACE (i.e. the http_archive entries) or did you install or unpack the tarball (or maybe used an rpm or something) in your system and then use the local install path? I think {prefix}/include/{file} might be a good add for the local BUILD files for TRT and cuDNN.

I've hit the linear decomposition issue in the containers before. It's due to simple API changes. I've added support for pyt_nightly (based on Nov) which has both of these fixes. https://github.com/NVIDIA/Torch-TensorRT/commit/38f968ac68001a00a138923466f645c7c8de0ae2. We have it available in release/ngc/22.02 branch

narendasan commented 2 years ago

@peri044 Can you stage a PR which cherry picks this change for master and tag it with the upstreaming tag?

svenevs commented 2 years ago

Hey sorry for the slow response! Thanks for following up, glad to see the fixes are already in. Thanks for explaining the staging structure, I should have searched different PRs and branches better! AFAICT this issue can be closed then :slightly_smiling_face:

RE tarball archive format, just re-downloaded them and am pasting the structure to hopefully help save time. If there's a way to enable both paths in the bazel setup I think that would be ideal. But again it's not particularly problematic because it's expected that users who have local installs are going to manipulate bazel no matter what, and the error messages about missing header or libs are fairly straightforward.

cudnn

Just include/ and lib, the install instructions encourage you to copy it to your CUDA install but it's also possible to just update CPATH and LD_LIBRARY_PATH and keep it in e.g., /opt/cudnn or something.

$ tree -d cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive
cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive
├── include
└── lib

2 directories

tensorrt

It includes a targets infrastructure, as well as symlyinks to top-level bin, include, and lib. The extracted folder for install should just get moved to somewhere permanent, e.g., /opt/tensorrt or what have you. Deleted doc and samples from output since they are not relevant here.

$ tree -d TensorRT-8.2.3.0
TensorRT-8.2.3.0
├── bin -> targets/x86_64-linux-gnu/bin
├── data
│   ├── char-rnn
│   │   └── model
│   ├── faster-rcnn
│   ├── googlenet
│   ├── int8_api
│   ├── mnist
│   ├── resnet50
│   └── ssd
│       └── batches
├── doc
 ...
├── graphsurgeon
├── include
├── lib -> targets/x86_64-linux-gnu/lib
├── onnx_graphsurgeon
├── python
├── samples
...
├── targets
│   └── x86_64-linux-gnu
│       ├── bin
│       ├── include -> ../../include
│       ├── lib
│       │   └── stubs
│       └── samples -> ../../samples
└── uff

104 directories

pytorch / TensorRT