pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.49k stars 344 forks source link

🐛 [Bug] TRT Error when compiling ViT with Dynamic Shape #3016

Closed Hukongtao closed 1 month ago

Hukongtao commented 1 month ago

Bug Description

To Reproduce

Minimal reproducible code:

import torch
import torch_tensorrt
from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model = model.eval().cuda()

inputs = [
    torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],
        opt_shape=[4, 3, 224, 224],
        max_shape=[16, 3, 224, 224],
        dtype=torch.float32
    )
]
# inputs = torch_tensorrt.Input(shape=[2, 3, 224, 224], dtype=torch.float32)
trt_gm = torch_tensorrt.compile(model, "dynamo", inputs)

Expected behavior

Model should compile with Dynamic shapes. But I got error:

WARNING:torch_tensorrt.dynamo._compiler:Node scaled_dot_product_attention of op type call_function does not have metadata. This could sometimes lead to undefined behavior.
WARNING:torch_tensorrt.dynamo._compiler:Some nodes do not have metadata (shape and dtype information). This could lead to problems sometimes if the graph has PyTorch and TensorRT segments.
INFO:torch_tensorrt.dynamo._compiler:Partitioning the graph via the fast partitioner
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init CUDA: CPU +489, GPU +0, now: CPU 6268, GPU 2121 (MiB)
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageChange] Init builder kernel library: CPU +1906, GPU +354, now: CPU 8327, GPU 2475 (MiB)
WARNING:torch_tensorrt [TensorRT Conversion Context]:CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
WARNING:torch_tensorrt.dynamo.conversion.converter_utils:Detected unparsable type in node formatting: <class 'torch.SymInt'>
WARNING:torch_tensorrt.dynamo.conversion.converter_utils:Detected unparsable type in node formatting: <class 'torch.SymInt'>
WARNING:torch_tensorrt.dynamo.conversion.converter_utils:Detected unparsable type in node formatting: <class 'torch.SymInt'>
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.058611
INFO:torch_tensorrt [TensorRT Conversion Context]:Global timing cache in use. Profiling results in this builder pass will be stored.
ERROR:torch_tensorrt [TensorRT Conversion Context]:IBuilder::buildSerializedNetwork: Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: [SLICE]-[aten_ops.expand.default]-[/vit_embeddings/expand]: ISliceLayer has out of bounds access on axis 0 Condition '<' violated: 3 >= 1.)
Traceback (most recent call last):
  File "/mnt/bn/hukongtao-infer-speed/mlx/users/kongtao.hu/codebase/EasyGuard_0617/speed_vit_test.py", line 27, in <module>
    trt_gm = torch_tensorrt.compile(model, "dynamo", inputs)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/_compile.py", line 250, in compile
    trt_graph_module = dynamo_compile(
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/_compiler.py", line 243, in compile
    trt_gm = compile_module(gm, inputs, settings)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/_compiler.py", line 431, in compile_module
    trt_module = convert_module(
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 107, in convert_module
    interpreter_result = interpret_module_to_result(module, inputs, settings)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/conversion/_conversion.py", line 88, in interpret_module_to_result
    interpreter_result = interpreter.run()
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py", line 350, in run
    assert serialized_engine
AssertionError

Environment

image

Additional context

Reference official documentation:
https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html

peri044 commented 1 month ago

Thanks for the repro. I've fixed this bug in this PR : https://github.com/pytorch/TensorRT/pull/3019

Hukongtao commented 1 month ago

thank you for your reply~ I use the latest version and modify the code according to your PR, and I got another error:

WARNING:torch_tensorrt.dynamo.conversion.converter_utils:Detected unparsable type in node formatting: <class 'torch.SymInt'>
WARNING:torch_tensorrt.dynamo.conversion.converter_utils:Detected unparsable type in node formatting: <class 'torch.SymInt'>
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.039768
INFO:torch_tensorrt [TensorRT Conversion Context]:Global timing cache in use. Profiling results in this builder pass will be stored.
INFO:torch_tensorrt [TensorRT Conversion Context]:Detected 1 inputs and 6 output network tensors.
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Host Persistent Memory: 5552
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Device Persistent Memory: 0
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Scratch Memory: 48365568
INFO:torch_tensorrt [TensorRT Conversion Context]:[BlockAssignment] Started assigning block shifts. This will take 4 steps to complete.
INFO:torch_tensorrt [TensorRT Conversion Context]:[BlockAssignment] Algorithm ShiftNTopDown took 0.031924ms to assign 2 blocks to 4 nodes requiring 61210624 bytes.
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Activation Memory: 61210624
INFO:torch_tensorrt [TensorRT Conversion Context]:Total Weights Memory: 10853632
INFO:torch_tensorrt [TensorRT Conversion Context]:Engine generation completed in 0.123574 seconds.
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 3 MiB, GPU 100 MiB
INFO:torch_tensorrt [TensorRT Conversion Context]:[MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 9363 MiB
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.135388
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 11179132 bytes of Memory
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 1496 bytes of code generator cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 157388 bytes of compilation cache.
INFO:torch_tensorrt [TensorRT Conversion Context]:Serialized 16 timing cache entries
WARNING: [Torch-TensorRT] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
Traceback (most recent call last):
  File "/mnt/bn/hukongtao-infer-speed/mlx/users/kongtao.hu/codebase/EasyGuard_0617/speed_vit_test.py", line 17, in <module>
    trt_gm = torch_tensorrt.compile(model, "dynamo", inputs)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/_compile.py", line 249, in compile
    trt_graph_module = dynamo_compile(
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/_compiler.py", line 243, in compile
    trt_gm = compile_module(gm, inputs, settings)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/_compiler.py", line 383, in compile_module
    submodule_inputs = partitioning.construct_submodule_inputs(submodule)
  File "/usr/local/lib/python3.9/dist-packages/torch_tensorrt/dynamo/partitioning/common.py", line 124, in construct_submodule_inputs
    raise AssertionError(
AssertionError: Input scaled_dot_product_attention does not contain metadata. Please ensure you have exported the graph correctly

image

Hukongtao commented 1 month ago

Looking forward to your reply

peri044 commented 1 month ago

@Hukongtao This error is because our lowering pass was not copying over the metadata of the attention op to its replaced variant. I've pushed a fix now to the same PR : https://github.com/pytorch/TensorRT/pull/3019. Can you give it a try ?

Hukongtao commented 1 month ago

@Hukongtao This error is because our lowering pass was not copying over the metadata of the attention op to its replaced variant. I've pushed a fix now to the same PR : #3019. Can you give it a try ?

LGTM