pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.61k stars 351 forks source link

❓ [Question] Running a same torchscript using the same input producing different results. #862

Closed SeTriones closed 2 years ago

SeTriones commented 2 years ago

❓ Question

I'm trying to run a pretrained resnet50 model from torch.torchvision.models. enabled_precisions is set to torch.half. Each time I load the same resnet50 torchscript, using the same input(which is set to zero using np.zeros). But after running serveral times I've found the output is not stable.

What you have already tried

I've tried two ways:

  1. Load the same resetnet50 torchscript and compile it, the do the inference. The output is not stable.
  2. Save the compiled script, load it each time and to the inference. The output is stable.

I wonder whether there's some random behaviors in torch_tensorrt.compile() when enabled_precisions is set to torch.half.

Environment

Additional context

The python code producing unstable result is as below:

from torchvision import models
import numpy as np
import torch
import torch_tensorrt
import time

input = np.zeros((1, 3, 224, 224)).astype(np.float32)
input = torch.from_numpy(input).cuda()

torch_script_module = torch.jit.load('torch_script_module.ts')

trt_ts_module = torch_tensorrt.compile(torch_script_module,
    inputs=[
        torch_tensorrt.Input(  # Specify input object with shape and dtype
            min_shape=[1, 3, 224, 224],
            opt_shape=[1, 3, 224, 224],
            max_shape=[1, 3, 224, 224],
            # For static size shape=[1, 3, 224, 224]
            dtype=torch.float32)  # Datatype of input tensor. Allowed options torch.(float|half|int8|int32|bool)
    ],
    enabled_precisions={torch.half},)  # Run with FP16)

result=trt_ts_module(input)  # run inference

t1 = time.time()
for i in range(1000):
    result=trt_ts_module(input)  # run inference
t2 = time.time()
print('result', result[0][0])
print("Cost: ", round(t2-t1, 4))

Two iterations produce different outputs: Iteration 1:

WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected invalid timing cache, setup a local cache instead
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
result tensor(-0.4390, device='cuda:0')
Cost:  1.3429

Iteration 2:

WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected invalid timing cache, setup a local cache instead
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
result tensor(-0.4463, device='cuda:0')
Cost:  1.3206
ncomly-nvidia commented 2 years ago

TensorRT performs "kernel auto-tuning" which essentially selects the fastest kernels for your models on your specific device. There can be a small amount of jitter in this step, for a variety of reasons, leading to different kernels being selected & thus different perf.

You can check the kernels are in fact different to confirm this.

Also, this looks like ~1.5% perf jitter for your model. Is this an issue in your application, or is this just out of curiosity? Have you seen larger variance between runs?

SeTriones commented 2 years ago

@ncomly-nvidia this is just for curiosity. I'm doing more experiments on the following model architectures:

efficientnet-b2 vit yolov5s(v6.0) yolov5m(v6.0) yolov5x(v6.0) tsm( batch 16) SwinTransformer3D bert transformer

github-actions[bot] commented 2 years ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

ncomly-nvidia commented 2 years ago

Hi @SeTriones, how have your other experiments gone? Is there other discrepancies in results or performance in the models you listed above which create concern for you?

github-actions[bot] commented 2 years ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days