🐛 [Bug] compiled model gives different outputs from torch model (used to work on torch_tensorrt 2.2.0)

Bug Description

my model outputs a tuple of mu and logvar. for the mu, there are 4 columns (features), consisting of 3 features of type A and 1 feature of type B. you can see the FinalEncoder.forward() code in the gist below for the details.

as sene below, for 3 features of type A, only the first feature matches the pytorch model. the 2nd and 3rd features are total garbage. for the type B feature, it matches the pytorch model.

this used to work perfectly fine on the previous version of torch-TensorRT (2.2.0) before I updated to 2.3.0. in fact, if you look at the model code, i had to write the trt_compat_mode specially for 2.3.0. When I was using 2.2.0, the original pytorch forward() actually compiled fine and gave the expected speedups (4 to 5 times)

torch mu

tensor([[ 0.1179,  0.2490,  0.0227,  0.7348],
        [ 0.1885,  0.3117, -0.0790, -0.6819],
        [ 0.2545, -0.2422,  0.1816,  1.1018],
        [-0.2488,  0.2577, -0.0928,  0.4927]],

tensorRT mu, 2nd & 3rd column is wrong

tensor([[ 0.1182, -0.0108, -0.0108,  0.7333],
        [ 0.1887, -0.0108, -0.0108, -0.6839],
        [ 0.2548, -0.0108, -0.0108,  1.1000],
        [-0.2486, -0.0108, -0.0108,  0.4902],

To Reproduce

Steps to reproduce the behavior:

initialize the pytorch model
compile to TensorRT
run inference & compare outputs against pytorch model

this is the model code https://gist.github.com/orioninthesky98/d0a987197950bc0b945d28b240d5bc53#file-model-py-L327-L352 the problematic part is highlighted in the gist. you can see the for-loop here and somehow only the 1st feature (inv_mu / inv_logvar) is correct but the remaining 2 are garbage

i've tried unrolling the loop myself (so hardcoding the indices provided into torch.index_select() just in case there was something wrong when tracing the for-loop. it still didn't fix the issue.

i tried to do stuff with torch._constrain_as_size(bs or num_inv_feats) but didn't find success as torch complained that those are not of type SymInt.

i have also tried changing all the .view() to .reshape() but that didn't change anything. i tried adding .clone(), .contiguous() and that didn't help either.

also something weird is that I was forced to use torch.index_select(). previously in torch_tensorrt 2.2.0, I could do plain slice-indexing and it compiled just fine, something like curr_input = masked_input[:, i, ...].

i tried to revert to torch_tensorrt 2.2.0, but very strangely, it rejects the use of torch.index_select() lol! with 2.2.0, i have to set trt_compat_mode=False, and then it compiles fine, AND it gives the correct outputs

pytorch_mu: tensor([[-1.3618e+05,  3.9028e+07,  1.6671e+07, -2.7819e+08],
        [ 1.2645e+07,  2.5498e+07, -2.1328e+07, -3.2754e+08],
        [-1.0710e+07, -1.4777e+07,  5.7531e+06, -2.5132e+08],
        [ 1.6348e+07,  5.0527e+07,  7.3478e+05, -3.3687e+08]], device='cuda:0')
tensorrt_mu: tensor([[-6.5385e+04,  3.9133e+07,  1.6772e+07, -2.7830e+08],
        [ 1.2640e+07,  2.5586e+07, -2.1301e+07, -3.2748e+08],
        [-1.0643e+07, -1.4718e+07,  5.8226e+06, -2.5134e+08],
        [ 1.6426e+07,  5.0602e+07,  7.4901e+05, -3.3704e+08]], device='cuda:0')

for the compilation I am using this code:

minibatch_size = 1024
net_input_shape = (1, 1, 1, 40)
x_rand = torch.rand((minibatch_size,) + tuple(net_input_shape))
x_rand = x_rand.to(device)
trt_model = trt.compile(
    encoder,
    inputs=[x_rand],
    enabled_precisions={torch.float32},
    optimization_level=5,
    use_fast_partitioner=True,
    dynamic=False,
    disable_tf32=True,
)

Expected behavior

compiled model outputs need to match torch model outputs, at least in approximation

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0): 2.3.0
PyTorch Version (e.g. 1.0): 2.3.0+cu121
CPU Architecture: x86_64
OS (e.g., Linux): Linux, "Ubuntu 22.04.4 LTS"
How you installed PyTorch (conda, pip, libtorch, source): pip
Python version: 3.10.14
CUDA version: 12.5
GPU models and configuration: 1 x H100
Any other relevant information:

Additional context

Hi @orioninthesky98 thanks for the details. I'm able to get the same results of torch_tensorrt and pytorch models by using the repro (a little changes) you gave:

Here's what I did: 1) uncomment this line (otherwise there's an type error): https://gist.github.com/orioninthesky98/d0a987197950bc0b945d28b240d5bc53#file-model-py-L342 I didn't touch other codes.

2) Run the inference code:

encoder = FinalEncoder().to("cuda")
encoder.eval()
minibatch_size = 1024
net_input_shape = (1, 1, 1, 40)

x_rand = torch.rand((minibatch_size,) + tuple(net_input_shape))
x_rand = x_rand.to("cuda")
trt_model = torch_tensorrt.compile(
    encoder,
    inputs=[x_rand],
    enabled_precisions={torch.float32},
    optimization_level=5,
    use_fast_partitioner=True,
    dynamic=False,
    disable_tf32=True,
)
print("==================== trt_model mu ====================")
print(trt_model(x_rand)[0])
print("==================== torch_model mu ====================")
print(encoder(x_rand)[0])

Then I can get the same results.

For your reference, here's my env:

tensorrt                      10.0.1
torch                         2.5.0.dev20240703+cu121
torch_tensorrt                2.5.0.dev0+feb4d84ff  (main branch as of today)
torchvision                   0.20.0.dev20240703+cu121

I recommend you using the latest Torch-TRT main branch to test again. Please let me know if you still get the same issue.

pytorch / TensorRT