pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.61k stars 351 forks source link

🐛 [Bug] Poor quantization accuracy of resnet50 model int8 using Torch-TRT PQT method #1229

Closed lixiaolx closed 1 year ago

lixiaolx commented 2 years ago

Bug Description

Perform int8 quantization on resnet50 in the reference Test-demo ( https://github.com/pytorch/TensorRT/tree/master/tests/py/ptq ), and compare the inference result with the original FP32, the accuracy is quite different

run test code:

import os import torch_tensorrt as torchtrt from torch_tensorrt.logging import * import torch import tensorrt as trt import torch.nn as nn from torch.nn import functional as F import torchvision import torchvision.transforms as transforms import timm

torchtrt.logging.set_reportable_log_level(torchtrt.logging.Level.Graph)

net = timm.create_model('resnet50', pretrained=False)

https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py

checkpoint_path = './resnet50_a1_0-14fe96d1.pth' weights = torch.load(checkpoint_path) net.load_state_dict(weights, strict=True)

net = net.cuda() model = torch.jit.script(net).eval() dummy_min = torch.rand(1, 3, 224, 224).cuda()

class TRTEntropyCalibrator(trt.IInt8EntropyCalibrator2):

def __init__(self, dataloader, **kwargs):
    trt.IInt8EntropyCalibrator2.__init__(self)

    self.cache_file = kwargs.get("cache_file", None)
    self.use_cache = kwargs.get("use_cache", False)
    self.device = kwargs.get("device", torch.device("cuda:0"))

    self.dataloader = dataloader
    self.dataset_iterator = iter(dataloader)
    self.batch_size = dataloader.batch_size
    self.current_batch_idx = 0

def get_batch_size(self):
    return self.batch_size

# TensorRT passes along the names of the engine bindings to the get_batch function.
# You don't necessarily have to use them, but they can be useful to understand the order of
# the inputs. The bindings list is expected to have the same ordering as 'names'.
def get_batch(self, names):
    if self.current_batch_idx + self.batch_size > self.dataloader.dataset.data.shape[0]:
        return None

    batch = self.dataset_iterator.next()
    self.current_batch_idx += self.batch_size
    # Treat the first element as input and others as targets.
    if isinstance(batch, list):
        batch = batch[0].to(self.device)
    return [batch.data_ptr()]

def read_calibration_cache(self):
    # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
    if self.use_cache:
        with open(self.cache_file, "rb") as f:
            return f.read()

def write_calibration_cache(self, cache):
    if self.cache_file:
        with open(self.cache_file, "wb") as f:
            f.write(cache)

testing_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]))

testing_dataloader = torch.utils.data.DataLoader(testing_dataset, batch_size=1, shuffle=False, num_workers=1) calibrator = TRTEntropyCalibrator(testing_dataloader)

compile_spec = { "inputs": [dummy_min], "enabled_precisions": {torch.float, torch.int8}, "calibrator": calibrator, "truncate_long_and_double": True, "device": { "device_type": torchtrt.DeviceType.GPU, "gpu_id": 0, "dla_core": 0, "allow_gpu_fallback": False, } }

qua_net = torchtrt.ts.compile(model, **compile_spec)

native_net = model.cuda() dummy_min = torch.rand(1, 3, 224, 224) native_out = native_net(dummy_min.cuda()) print("native_out") print(native_out) aiak_out = qua_net(dummy_min.cuda()) print("qua_net") print(qua_net)

peri044 commented 2 years ago

1) The resnet model from timm is trained on ImageNet dataset with an image size of 224x224. 2) You are using CIFAR 10 for calibration which has image size of 32x32. Ideally the calibration dataset should be a subset of validation set of ImageNet. Can you verify if you're using representative data ? The reference scripts use VGG16 trained on CIFAR10 itself.

lixiaolx commented 2 years ago
  1. The resnet model from timm is trained on ImageNet dataset with an image size of 224x224.
  2. You are using CIFAR 10 for calibration which has image size of 32x32. Ideally the calibration dataset should be a subset of validation set of ImageNet. Can you verify if you're using representative data ? The reference scripts use VGG16 trained on CIFAR10 itself.

I tried to quantify resnet50 with the IMageNet dataset, and found that the difference in results is still larger than that of FP32 before quantization. Please help to follow up and optimize the quantization accuracy of torch-tensorrt.

Here is my experiment: data set: I tested the full and subset (500, 1000 randomly) with the val dataset, Verification method: 1.dataloader_calibrator 2.trt_calibrator

Consult and ask two questions: What is the loss of test accuracy after you finish torch-TRT? Is the loss of precision after quantization as expected?

peri044 commented 2 years ago

What is the loss of test accuracy after you finish torch-TRT? Is the loss of precision after quantization as expected?

We didn't perform RN50 PTQ test on ImageNet (in Torch-TRT) but in ONNX-TRT, INT8 accuracy of Resnet PTQ should be very close to FP32. Here is the reference https://github.com/NVIDIA/TensorRT/tree/main/tools/tensorflow-quantization/examples/resnet#resnet50-v2

Can you provide a end-end script of your INT8 PTQ test with imagenet data loading and processing part so that we can reproduce this bug ?

What's the accuracy difference b/w FP32 and INT8 that you observe ?

lixiaolx commented 2 years ago

@peri044 hello,My test results and test code are as follows:

What's the accuracy difference b/w FP32 and INT8 that you observe ?

I tested the comparison of fp32 and int8 with batchsize=64, and the results are as follows: native bs=64,fp32: imageNet1k, valadation 5w result:batch_size:64 Acc@1 71.341 (28.659) Acc@5 88.570 (11.430) int8 ,bs=64: imageNet1k, valadation 5w result:batch_size:64 Acc@1 2.153 (97.847) Acc@5 4.580 (95.421)

Comparison of single inference results with batchsize=64: native_out tensor([[ -9.1032, -7.7718, -6.5369, ..., -10.2060, -8.6407, -5.4662], [ -8.9152, -8.2441, -7.0621, ..., -9.7499, -8.9928, -5.6536], [ -9.2188, -8.0838, -6.5408, ..., -10.1868, -8.7118, -5.4275], ..., [ -8.7292, -7.8087, -6.5613, ..., -9.5876, -8.5580, -5.5388], [ -9.0750, -8.2901, -6.8502, ..., -9.8999, -8.8651, -5.4163], [ -9.1622, -7.8823, -6.9106, ..., -10.2783, -8.8240, -5.9388]], device='cuda:0') qua_net tensor([[-9.6645, -6.5313, -7.2518, ..., -9.8984, -8.2882, -4.1683], [-9.5294, -6.6310, -7.3841, ..., -9.7229, -8.3078, -4.2323], [-9.4890, -6.5426, -7.2658, ..., -9.5910, -8.1462, -4.1049], ..., [-9.5021, -6.8150, -7.2517, ..., -9.8203, -8.3972, -4.2234], [-9.5908, -6.6682, -7.3966, ..., -9.8232, -8.3988, -4.2732], [-9.4246, -6.5108, -7.2799, ..., -9.6772, -8.1263, -4.1765]], device='cuda:0') abs_tensor: tensor([[0.5613, 1.2405, 0.7149, ..., 0.3075, 0.3525, 1.2979], [0.6142, 1.6131, 0.3220, ..., 0.0269, 0.6850, 1.4213], [0.2702, 1.5412, 0.7250, ..., 0.5958, 0.5657, 1.3226], ..., [0.7729, 0.9937, 0.6904, ..., 0.2327, 0.1608, 1.3155], [0.5158, 1.6219, 0.5464, ..., 0.0767, 0.4662, 1.1431], [0.2624, 1.3716, 0.3693, ..., 0.6011, 0.6977, 1.7623]], device='cuda:0')

Below is my test code: import os import torch_tensorrt as torchtrt from torch_tensorrt.logging import * import torch import tensorrt as trt import torch.nn as nn from torch.nn import functional as F import torchvision import torchvision.transforms as transforms import timm from torchvision.datasets import ImageFolder import time from timm.utils import accuracy, AverageMeter from torch.utils.data import DataLoader from tqdm import tqdm from collections import OrderedDict

torchtrt.logging.set_reportable_log_level(torchtrt.logging.Level.Graph)

net = timm.create_model('resnet50', pretrained=False)

https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py

checkpoint_path = './resnet50_a1_0-14fe96d1.pth' weights = torch.load(checkpoint_path) net.load_state_dict(weights, strict=True) net = net.cuda() model = torch.jit.script(net).eval() dummy_min = torch.rand(64, 3, 224, 224).cuda()

class TRTEntropyCalibrator(trt.IInt8EntropyCalibrator2):

def __init__(self, dataloader, **kwargs):
    trt.IInt8EntropyCalibrator2.__init__(self)

    self.cache_file = kwargs.get("cache_file", None)
    self.use_cache = kwargs.get("use_cache", False)
    self.device = kwargs.get("device", torch.device("cuda:0"))

    self.dataloader = dataloader
    self.dataset_iterator = iter(dataloader)
    self.batch_size = dataloader.batch_size
    self.current_batch_idx = 0

def get_batch_size(self):
    return self.batch_size

# TensorRT passes along the names of the engine bindings to the get_batch function.
# You don't necessarily have to use them, but they can be useful to understand the order of
# the inputs. The bindings list is expected to have the same ordering as 'names'.
def get_batch(self, names):
    if self.current_batch_idx + self.batch_size > len(self.dataloader.dataset):
        return None

    batch = self.dataset_iterator.next()
    self.current_batch_idx += self.batch_size
    # Treat the first element as input and others as targets.
    if isinstance(batch, list):
        batch = batch[0].to(self.device)
    return [batch.data_ptr()]

def read_calibration_cache(self):
    # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
    if self.use_cache:
        with open(self.cache_file, "rb") as f:
            return f.read()

def write_calibration_cache(self, cache):
    if self.cache_file:
        with open(self.cache_file, "wb") as f:
            f.write(cache)

val_transforms = torchvision.transforms.Compose([ torchvision.transforms.Resize(224), torchvision.transforms.RandomResizedCrop(224), torchvision.transforms.ToTensor(), torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]) root='./val' testing_dataset = ImageFolder(root, transform=val_transforms) testing_dataloader = torch.utils.data.DataLoader(testing_dataset, batch_size=64, shuffle=False, num_workers=8) calibrator = TRTEntropyCalibrator(testing_dataloader) compile_spec = { "inputs": [dummy_min], "enabled_precisions": {torch.float, torch.int8}, "calibrator": calibrator, "truncate_long_and_double": True, "device": { "device_type": torchtrt.DeviceType.GPU, "gpu_id": 0, "dla_core": 0, "allow_gpu_fallback": False, } }

qua_net = torchtrt.ts.compile(model, **compile_spec)

native_net = model.cuda() dummy_min = torch.rand(64, 3, 224, 224) with torch.no_grad(): native_out = native_net(dummy_min.cuda()) print("native_out")

print(native_out) with torch.no_grad(): qua_out = qua_net(dummy_min.cuda()) print("qua_net") print(qua_out)

sub_result = torch.sub(native_out, qua_out) abs_tensor = torch.abs(sub_result)

print("abs_tensor:\n") print(abs_tensor)

drop_last=True num_workers=8 batch_size=64

batch_time = AverageMeter() losses = AverageMeter() top1 = AverageMeter() top5 = AverageMeter() dataset = ImageFolder(root, transform=val_transforms) loader = DataLoader(dataset, batch_size, drop_last=drop_last, num_workers=num_workers)

for batch_idx, batch in enumerate(tqdm(loader)): img, label = batch end = time.time() img, label = img.cuda(), label.cuda() with torch.no_grad(): outputs = qua_net(img) acc1, acc5 = accuracy(outputs.detach(), label, topk=(1, 5)) losses.update(0., outputs.size(0)) top1.update(acc1.item(), outputs.size(0)) top5.update(acc5.item(), outputs.size(0)) batch_time.update(time.time() - end) top1a, top5a = top1.avg, top5.avg

results = OrderedDict( model="UTF-8", top1=round(top1a, 4), top1_err=round(100 - top1a, 4), top5=round(top5a, 4), top5_err=round(100 - top5a, 4), img_size=batch_size) print("qua-benckmark result:") line = 'batch_size:{} * Acc@1 {:.3f} ({:.3f})\t Acc@5 {:.3f} ({:.3f})\t time@avg{:.4f}'.format( batch_size, results['top1'], results['top1_err'], results['top5'], results['top5_err'], batch_time.avg) print("imageNet1k, valadation 5w result:") print(line)

lixiaolx commented 2 years ago

@peri044 Hello, do you have any test scripts related to testing PTQ quantization int8 resnet50? Can you post the relevant code?

ncomly-nvidia commented 2 years ago

CC: @tanayvarshney

github-actions[bot] commented 2 years ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] commented 1 year ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

peri044 commented 1 year ago

@lixiaolx Do you still see the same error with the latest main ?

github-actions[bot] commented 1 year ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days