🐛 [Bug] Cannot compile SwinIR model (shape_analysis.cpp: Expected ivalues_maps.count(input) to be true but got false)

arition commented 1 year ago

Bug Description

Cannot compile the SwinIR model.

Error message:

Traceback (most recent call last):
  File "main.py", line 61, in <module>
    compile_tensorrt_model(torch.float)
  File "main.py", line 56, in compile_tensorrt_model
    compiled_model = torch_tensorrt.compile(traced_model, inputs=inputs, enabled_precisions=enabled_precisions,
  File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/_compile.py", line 125, in compile
    return torch_tensorrt.ts.compile(
  File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/ts/_compiler.py", line 136, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:167] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* 71852 produced from %71852 : Tensor = aten::add(%71851, %71850, %71848) in lowering graph for mini graph input.

To Reproduce

The original code is not properly typed, so I modified it a bit. Repo: https://github.com/arition/SwinIR-TensorRT

What I changed compared to original code:

Add proper typing
Disable use_checkpoint
Disable other variants except real-world SR
Replace modulo operator to custom function according to https://github.com/pytorch/TensorRT/issues/1305

To reproduce, just download pretrained weight (link in code) and run main.py.

Expected behavior

The model compiles without problems.

Environment

I use PyTorch container 23.01-py3 on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

CPU Architecture: x64
OS (e.g., Linux): Linux
GPU models and configuration: RTX 4090

arition commented 1 year ago

Any updates on this issue?

github-actions[bot] commented 1 year ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] commented 11 months ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

arition commented 11 months ago

Still no updates?

willianck commented 7 months ago

Did you have any luck finding the issue @arition. I am using the pytorch implementation of the Swin Transformer and having the same problem occuring. It is not able to identify the aten::add op when calculating the shifted window attention and throws the same error. I was able to get the model to compile using only torchscript but unsuccessful when combining it with torch_tensorrt.

willianck commented 7 months ago

Would appreciate if a maintainer of this repo could point me in the right direction @narendasan

bowang007 commented 7 months ago

Hi @willianck could you please share the logs? Thanks!

willianck commented 7 months ago

Sure here it is.

Error message:

Traceback (most recent call last):
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 228, in <module>
    main()
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 220, in main
    batch_inference(args.model,
  File "/home/manifold12/Software/benchmark/test_inference/real_time_inference.py", line 152, in batch_inference
    optimized_model = trt.compile(
  File "/home/manifold12/blend/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 185, in compile
    compiled_ts_module: torch.jit.ScriptModule = torchscript_compile(
  File "/home/manifold12/blend/lib/python3.10/site-packages/torch_tensorrt/ts/_compiler.py", line 151, in compile
    compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:183] Expected ivalues_maps.count(input) to be true but got false
Could not find torch::jit::Value* attn.21 produced from %attn.21 : Tensor = aten::add(%attn.9, %36217, %46) # /home/manifold12/blend/lib/python3.10/site-packages/torchvision/models/swin_transformer.py:192:11 in lowering graph for mini graph input.

bowang007 commented 7 months ago

@willianck could you please share the full log plz? Looks like in some blocks of the IR the operations cannot capture the outer variables above. I might have a fix for that. If you could share the full log that would be more helpful. Thanks!

willianck commented 7 months ago

Got It, here is the full log using the debug log level from torch_tensorrt . I placed it inside a file for you to view as It is quite a lot of lines. Let me know if it okay and I can alternatively paste the logs on here. output.txt

willianck commented 7 months ago

I was able to get it to work using dynamo or torch_compile instead of torchscriptin the irargument of torch_tensorrt.compile() as shown below.

import torch_tensorrt as trt
  optimized_model = trt.compile(
                                    model,
                                    ir='dynamo',
                                    inputs= inputs,
                                    enabled_precisions=enabled_precisions,
     )

With this new implementation I am now experience a separate issue. When testing at different batch sizes the memory consumption is abnormally high and leads to OOM error at batch sizes I was able to do inference on before (no torch.compile() call ). This is similar to this issue #1854 i suppose . Something else I noticed was that for certain batch sizes it would not throw an OOM error but would throw this error instead as seen in a snippet of the full log below:


INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.090343
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:01.634456
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 271888896 bytes of Memory
DEBUG: [Torch-TensorRT] - Serialized Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserializing Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserialized Device Info: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Target Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Setting Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU) as active device
INFO: [Torch-TensorRT] - The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
INFO: [Torch-TensorRT] - Loaded engine size: 20 MiB
DEBUG: [Torch-TensorRT] - Deserialization required 4252 microseconds.
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +20, now: CPU 0, GPU 4910 (MiB)
DEBUG: [Torch-TensorRT] - Total per-runner device persistent memory is 0
DEBUG: [Torch-TensorRT] - Total per-runner host persistent memory is 64
DEBUG: [Torch-TensorRT] - Allocated activation device memory of size 271888896
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +260, now: CPU 0, GPU 5170 (MiB)
DEBUG: [Torch-TensorRT] - CUDA lazy loading is enabled.
DEBUG: [Torch-TensorRT] - Input binding name: roll_13 has TensorRT binding index: 0, Torch binding index: 0
DEBUG: [Torch-TensorRT] - Input binding name: add_57 has TensorRT binding index: 1, Torch binding index: 1
DEBUG: [Torch-TensorRT] - Output binding name: output0 has TensorRT binding index: 3, Torch binding index: 2
DEBUG: [Torch-TensorRT] - Output binding name: output1 has TensorRT binding index: 2, Torch binding index: 3
DEBUG: [Torch-TensorRT] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_28_engine
  Inputs: [
    id: 0
      name: roll_13
      shape: [8, 35, 35, 512]
      dtype: Float
    id: 1
      name: add_57
      shape: [8, 32, 32, 512]
      dtype: Float
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [8, 35, 35, 512]
      dtype: Float
    id: 1
      name: output1
      shape: [8, 32, 32, 512]
      dtype: Float
  }
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
  Hardware Compatibility: Disabled

INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.021885
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.661630
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 151155200 bytes of Memory
DEBUG: [Torch-TensorRT] - Serialized Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserializing Device Info: 0%8%9%0%NVIDIA GeForce RTX 4090
DEBUG: [Torch-TensorRT] - Deserialized Device Info: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Target Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
DEBUG: [Torch-TensorRT] - Setting Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU) as active device
INFO: [Torch-TensorRT] - The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
INFO: [Torch-TensorRT] - Loaded engine size: 4 MiB
DEBUG: [Torch-TensorRT] - Deserialization required 1199 microseconds.
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +4, now: CPU 0, GPU 5161 (MiB)
DEBUG: [Torch-TensorRT] - Total per-runner device persistent memory is 0
DEBUG: [Torch-TensorRT] - Total per-runner host persistent memory is 32
DEBUG: [Torch-TensorRT] - Allocated activation device memory of size 151155200
INFO: [Torch-TensorRT] - [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +144, now: CPU 0, GPU 5305 (MiB)
DEBUG: [Torch-TensorRT] - CUDA lazy loading is enabled.
DEBUG: [Torch-TensorRT] - Input binding name: roll_14 has TensorRT binding index: 0, Torch binding index: 0
DEBUG: [Torch-TensorRT] - Output binding name: output0 has TensorRT binding index: 1, Torch binding index: 1
DEBUG: [Torch-TensorRT] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_30_engine
  Inputs: [
    id: 0
      name: roll_14
      shape: [8, 35, 35, 512]
      dtype: Float
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [8, 35, 35, 512]
      dtype: Float
  }
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 4090, SM Capability: 8.9, Type: GPU)
  Hardware Compatibility: Disabled

INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.087953
[01/26/2024-19:39:56] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[[SLICE]-[unknown_ir_ops.slice.Tensor]-[/features/5/__11/attn/slice_301]...[ELEMENTWISE]-[aten_ops.native_layer_norm.default]-[/features/5/__12/norm1/native_layer_norm_35_add_beta]]}.
[01/26/2024-19:39:56] [TRT] [E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[[SLICE]-[unknown_ir_ops.slice.Tensor]-[/features/5/__11/attn/slice_301]...[ELEMENTWISE]-[aten_ops.native_layer_norm.default]-[/features/5/__12/norm1/native_layer_norm_35_add_beta]]}.)
Traceback (most recent call last):
  File "/root/test_inference/real_time_inference.py", line 238, in <module>
    main()
  File "/root/test_inference/real_time_inference.py", line 230, in main
    batch_inference(args.model,
  File "/root/test_inference/real_time_inference.py", line 161, in batch_inference
    optimized_model = trt.compile(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/_compile.py", line 228, in compile
    trt_graph_module = dynamo_compile(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/_compiler.py", line 245, in compile
    return compile_module(gm, inputs, settings)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/_compiler.py", line 415, in compile_module
    trt_module = convert_module(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 75, in convert_module
    interpreter_result = interpret_module_to_result(module, inputs, settings)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 56, in interpret_module_to_result
    interpreter_result = interpreter.run()
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py", line 256, in run
    assert engine
AssertionError

Would also like to point out that when I tried running these tests with the same batch sizes on a gpu with higher memory RTX A6000, the memory consumption was still very high but It did not run into any OOM error or the error shown above.

Environment

I am running all these tests in a container build on top of the pytorch Tensor RT docker image as base image. https://github.com/pytorch/TensorRT

CPU Architecture: x64
OS (e.g., Linux): Linux
GPU models and configuration: RTX 4090, RTX A6000
python version: 3.10
pytorch version: 2.3.0.dev
torch_tensorrt version: 2.3.0.dev
tensorrt version: 8.6.1
CUDA version: 12.1
CUDNN version: 8.9.2

bowang007 commented 7 months ago

Looks like an operator conversion issue. @gs-olive @zewenli98 Do we support this aten_ops.native_layer_norm.default operation?

gs-olive commented 7 months ago

@bowang007 - yes, there is support for that operator, as here: https://github.com/pytorch/TensorRT/blob/cf3a6887626c648e5747fdbfa5bc62b361a82b02/py/torch_tensorrt/dynamo/conversion/aten_ops_converters.py#L123-L152

willianck commented 7 months ago

any updates on this issue ? @bowang007

arition commented 6 months ago

Any updates? @bowang007

bowang007 commented 6 months ago

After going through the logs:

For TorchScript path, looks like there are some bugs in our partitioning workflow. The TorchScript partitioning workflow was developed several years ago and we used some naive greedy algorithm for graph segmentation at that time, we TorchScript path is being deprecated right now, we don't have a plan to support that.
For dynamo path, I guess there is a bug when converting layer_norm to some TensorRT layers (some slice layer is introduced). Let me check with our dev team and run this model if possible.

Thanks!

arition commented 6 months ago

@bowang007 Thanks for your analysis! Looking forward to find the actual cause and fix the bug!

pytorch / TensorRT