pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.59k stars 350 forks source link

❓ [Question] How do you solve the error: Expected Tensor but got Uninitialized? #1282

Closed Mark-M2L closed 2 years ago

Mark-M2L commented 2 years ago

❓ Question

Currently, I am compiling a custom segmentation model using torch_tensorrt.compile(), using a model script obtained from jit. The code to compile is as follows:

scripted_model = torch.jit.freeze(torch.jit.script(model))

inputs = [torch_tensorrt.Input(
            min_shape=[2, 3, 600, 400],
            opt_shape=[2, 3, 600, 400],
            max_shape=[2, 3, 600, 400],
            dtype=torch.float,
        )]
enabled_precisions = {torch.float, torch.half}

with torch_tensorrt.logging.debug():
    trt_ts_module = torch_tensorrt.compile(scripted_model, inputs=inputs, enabled_precisions=enabled_precisions)

The code fails to compile at the following step:

        a = self.compression(torch.cat(x_list, 1))
        b = self.shortcut(x)

        c = a + b

        return c

, throwing the following error:

Traceback (most recent call last):
  File "test.py", line 118, in <module>
    trt_ts_module = torch_tensorrt.compile(scripted_model, inputs=inputs, enabled_precisions=enabled_precisions)
  File "/home/oem/.pyenv/versions/ddrnet/lib/python3.8/site-packages/torch_tensorrt/_compile.py", line 115, in compile
    return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
  File "/home/oem/.pyenv/versions/ddrnet/lib/python3.8/site-packages/torch_tensorrt/ts/_compiler.py", line 113, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: Expected Tensor but got Uninitialized

It seems that some variable is uninitialized. However, the strange thing is that replacing the previous code with the following code pieces both compile:

        a = self.compression(torch.cat(x_list, 1))

        return a

and

        b = self.shortcut(x)

        return b

So, somehow taking the sum of these two tensors results in a failure to compile. Do you have any suggestions I can try such that this step compiles as well?

What you have already tried

Tried adding the following two parameters to the compilation step as well:

trt_ts_module = torch_tensorrt.compile(scripted_model, inputs=inputs, enabled_precisions=enabled_precisions, torch_executed_ops=["prim::ListConstruct"], min_block_size=1)
trt_ts_module = torch_tensorrt.compile(scripted_model, inputs=inputs, enabled_precisions=enabled_precisions, torch_executed_ops=["prim::ListConstruct"])
trt_ts_module = torch_tensorrt.compile(scripted_model, inputs=inputs, enabled_precisions=enabled_precisions, min_block_size=1)

, but these resulted in different errors, thus I decided not to use these parameters for now.

Environment

Looking forwards to your answer, thanks in advance.

bowang007 commented 2 years ago

@Mark-M2L could you please try if this works? https://github.com/pytorch/TensorRT/issues/983

Mark-M2L commented 2 years ago

Thanks a lot for your reply. I have taken a look into it and have a question about this implementation. I am thinking of two ways of implementing the possible solution you suggest. The first is to create a scripted model using torch.jit.script(model) and then feed this scripted model to the code written in remove_exceptions.cpp to generate a new graph. But in that case, should I create a scripted model again, which can be fed into PyTorch and then use torch_tensorrt to compile this scripted model?

The second way I'm thinking of, is to rewrite the code in remove_exceptions.cpp to Python code such that we can directly remove the exceptions with Python. After removing the exceptions, we can feed this scripted model to the compiler of torch_tensorrt.

Would you suggest to use any of these implementations, or would you suggest something different?

bowang007 commented 2 years ago

Hey @Mark-M2L, sorry I didn't make it clear. I think what you can do is simply add torch::jit::exception_elimination right after this line: https://github.com/pytorch/TensorRT/blob/5d1acbacb3928c7d5b1f125cf8fe98c9bbaffbeb/core/lowering/lowering.cpp#L35. Then recompile Torch-TensorRT, it should be good.

Mark-M2L commented 2 years ago

Thanks for your clarification. I recompiled Torch-TensorRT using your suggestion (previously built it using Python, as explained in https://github.com/pytorch/TensorRT/issues/1026#issuecomment-1119561746). The compilation went fine, but adding your step and then recompiling, I still got the same error (Expected Tensor but got Uninitialized). What versions of CUDA, cuDNN, TensorRT, and torch_tensorrt did you use to get your code to compile? Perhaps I have used the wrong versions.

bowang007 commented 2 years ago

Hey @Mark-M2L, do you have a small repro so I can also run and test locally? Btw, we see this error previously when we have this kind of operations:

if a:
 do something;
else:
 raiseException

This happens because when we have a If node like this, we do it in fallback and there is no raiseException corresponded value, as explained here #983.

Do you have detailed logs about the part that fails? The graph etc. I think when you are doing c = a + b there is a check and if the shape doesn't match it would throw out an exception.

Mark-M2L commented 2 years ago

Hi @bowang007 Thank you very much for your help. I created a small reproducible repo at Reproduced repo. It is basically a copy of DDRNet.pytorch, but modified such that it compiles with torch_tensorrt (in the future).

For this repo, I use Python 3.8.13. The packages that are installed are:

image

Regarding the logs where it fails, it seems that the RuntimeError is suddenly thrown. I have tried the logging with both Debug and Graph (logging with Error does not print anything outside the RuntimeError). Some things that I see are:

image

This one could indicate that the shape is wrong, however it also seems that it happens in layer1, line 325 of ddrnet_23_slim.py. The part where we apply c = a + b occurs later, right before the final layer. The code for c = a + b is written in line 202, in class DAPPM. So, I doubt this is the error that leads to the final error being thrown.

Other suspect debug lines are in the following lines:

image image image

However, they do not seem to lead to the RuntimeError and do not seem to interrupt the program.

Does this give you enough information? Of course, I can provide you with more information if requested. Thanks a lot for your help.

Mark-M2L commented 2 years ago

Hi @bowang007, did you perhaps have time to test the sample code? Would be really helpful to us if you can help us out.

bowang007 commented 2 years ago

Hey @Mark-M2L , I'm going to have a test this week. Stuck on something else last week. Will update to you soon.

bowang007 commented 2 years ago

Hey @Mark-M2L I run your model locally and I have this bug https://github.com/pytorch/TensorRT/issues/1336. Seems like it's because I'm using latest Torch-TensorRT version while you are using 1.10, and there are some changes since 1.10.

I'm now trying to support your model together with this pr: https://github.com/pytorch/TensorRT/pull/1263. Hopefully this could be completed this week. I will reply to you once your model is supported.

Mark-M2L commented 2 years ago

Hi @bowang007, thank you very much for taking your time to support the model. I really appreciate it. Looking forwards to your update :)

bowang007 commented 2 years ago

Hey @Mark-M2L, sorry I forgot to reply earlier. Could you please try this 2 PRs #1263 , #1345 ? I tested locally and your model is supported using these 2 PRs. The result is also good. Please update and close this issue if you succeed. Thanks.

Mark-M2L commented 2 years ago

Hi @bowang007 Thanks a lot! So you suggest that one of the two PRs worked for you? Then I will start testing one of them as soon as I have time. If it succeeds I will notify here and close the issue.

bowang007 commented 2 years ago

@Mark-M2L All the 2 PRs mentioned above are merged into master branch, so your model should be supported by master branch now. I'm closing this issue.