Closed lixiaolx closed 1 year ago
1) The resnet model from timm is trained on ImageNet dataset with an image size of 224x224. 2) You are using CIFAR 10 for calibration which has image size of 32x32. Ideally the calibration dataset should be a subset of validation set of ImageNet. Can you verify if you're using representative data ? The reference scripts use VGG16 trained on CIFAR10 itself.
- The resnet model from timm is trained on ImageNet dataset with an image size of 224x224.
- You are using CIFAR 10 for calibration which has image size of 32x32. Ideally the calibration dataset should be a subset of validation set of ImageNet. Can you verify if you're using representative data ? The reference scripts use VGG16 trained on CIFAR10 itself.
I tried to quantify resnet50 with the IMageNet dataset, and found that the difference in results is still larger than that of FP32 before quantization. Please help to follow up and optimize the quantization accuracy of torch-tensorrt.
Here is my experiment: data set: I tested the full and subset (500, 1000 randomly) with the val dataset, Verification method: 1.dataloader_calibrator 2.trt_calibrator
Consult and ask two questions: What is the loss of test accuracy after you finish torch-TRT? Is the loss of precision after quantization as expected?
What is the loss of test accuracy after you finish torch-TRT? Is the loss of precision after quantization as expected?
We didn't perform RN50 PTQ test on ImageNet (in Torch-TRT) but in ONNX-TRT, INT8 accuracy of Resnet PTQ should be very close to FP32. Here is the reference https://github.com/NVIDIA/TensorRT/tree/main/tools/tensorflow-quantization/examples/resnet#resnet50-v2
Can you provide a end-end script of your INT8 PTQ test with imagenet data loading and processing part so that we can reproduce this bug ?
What's the accuracy difference b/w FP32 and INT8 that you observe ?
@peri044 hello,My test results and test code are as follows:
What's the accuracy difference b/w FP32 and INT8 that you observe ?
I tested the comparison of fp32 and int8 with batchsize=64, and the results are as follows: native bs=64,fp32: imageNet1k, valadation 5w result:batch_size:64 Acc@1 71.341 (28.659) Acc@5 88.570 (11.430) int8 ,bs=64: imageNet1k, valadation 5w result:batch_size:64 Acc@1 2.153 (97.847) Acc@5 4.580 (95.421)
Comparison of single inference results with batchsize=64: native_out tensor([[ -9.1032, -7.7718, -6.5369, ..., -10.2060, -8.6407, -5.4662], [ -8.9152, -8.2441, -7.0621, ..., -9.7499, -8.9928, -5.6536], [ -9.2188, -8.0838, -6.5408, ..., -10.1868, -8.7118, -5.4275], ..., [ -8.7292, -7.8087, -6.5613, ..., -9.5876, -8.5580, -5.5388], [ -9.0750, -8.2901, -6.8502, ..., -9.8999, -8.8651, -5.4163], [ -9.1622, -7.8823, -6.9106, ..., -10.2783, -8.8240, -5.9388]], device='cuda:0') qua_net tensor([[-9.6645, -6.5313, -7.2518, ..., -9.8984, -8.2882, -4.1683], [-9.5294, -6.6310, -7.3841, ..., -9.7229, -8.3078, -4.2323], [-9.4890, -6.5426, -7.2658, ..., -9.5910, -8.1462, -4.1049], ..., [-9.5021, -6.8150, -7.2517, ..., -9.8203, -8.3972, -4.2234], [-9.5908, -6.6682, -7.3966, ..., -9.8232, -8.3988, -4.2732], [-9.4246, -6.5108, -7.2799, ..., -9.6772, -8.1263, -4.1765]], device='cuda:0') abs_tensor: tensor([[0.5613, 1.2405, 0.7149, ..., 0.3075, 0.3525, 1.2979], [0.6142, 1.6131, 0.3220, ..., 0.0269, 0.6850, 1.4213], [0.2702, 1.5412, 0.7250, ..., 0.5958, 0.5657, 1.3226], ..., [0.7729, 0.9937, 0.6904, ..., 0.2327, 0.1608, 1.3155], [0.5158, 1.6219, 0.5464, ..., 0.0767, 0.4662, 1.1431], [0.2624, 1.3716, 0.3693, ..., 0.6011, 0.6977, 1.7623]], device='cuda:0')
Below is my test code: import os import torch_tensorrt as torchtrt from torch_tensorrt.logging import * import torch import tensorrt as trt import torch.nn as nn from torch.nn import functional as F import torchvision import torchvision.transforms as transforms import timm from torchvision.datasets import ImageFolder import time from timm.utils import accuracy, AverageMeter from torch.utils.data import DataLoader from tqdm import tqdm from collections import OrderedDict
net = timm.create_model('resnet50', pretrained=False)
checkpoint_path = './resnet50_a1_0-14fe96d1.pth' weights = torch.load(checkpoint_path) net.load_state_dict(weights, strict=True) net = net.cuda() model = torch.jit.script(net).eval() dummy_min = torch.rand(64, 3, 224, 224).cuda()
class TRTEntropyCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, dataloader, **kwargs):
trt.IInt8EntropyCalibrator2.__init__(self)
self.cache_file = kwargs.get("cache_file", None)
self.use_cache = kwargs.get("use_cache", False)
self.device = kwargs.get("device", torch.device("cuda:0"))
self.dataloader = dataloader
self.dataset_iterator = iter(dataloader)
self.batch_size = dataloader.batch_size
self.current_batch_idx = 0
def get_batch_size(self):
return self.batch_size
# TensorRT passes along the names of the engine bindings to the get_batch function.
# You don't necessarily have to use them, but they can be useful to understand the order of
# the inputs. The bindings list is expected to have the same ordering as 'names'.
def get_batch(self, names):
if self.current_batch_idx + self.batch_size > len(self.dataloader.dataset):
return None
batch = self.dataset_iterator.next()
self.current_batch_idx += self.batch_size
# Treat the first element as input and others as targets.
if isinstance(batch, list):
batch = batch[0].to(self.device)
return [batch.data_ptr()]
def read_calibration_cache(self):
# If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
if self.use_cache:
with open(self.cache_file, "rb") as f:
return f.read()
def write_calibration_cache(self, cache):
if self.cache_file:
with open(self.cache_file, "wb") as f:
f.write(cache)
val_transforms = torchvision.transforms.Compose([ torchvision.transforms.Resize(224), torchvision.transforms.RandomResizedCrop(224), torchvision.transforms.ToTensor(), torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]) root='./val' testing_dataset = ImageFolder(root, transform=val_transforms) testing_dataloader = torch.utils.data.DataLoader(testing_dataset, batch_size=64, shuffle=False, num_workers=8) calibrator = TRTEntropyCalibrator(testing_dataloader) compile_spec = { "inputs": [dummy_min], "enabled_precisions": {torch.float, torch.int8}, "calibrator": calibrator, "truncate_long_and_double": True, "device": { "device_type": torchtrt.DeviceType.GPU, "gpu_id": 0, "dla_core": 0, "allow_gpu_fallback": False, } }
qua_net = torchtrt.ts.compile(model, **compile_spec)
native_net = model.cuda() dummy_min = torch.rand(64, 3, 224, 224) with torch.no_grad(): native_out = native_net(dummy_min.cuda()) print("native_out")
print(native_out) with torch.no_grad(): qua_out = qua_net(dummy_min.cuda()) print("qua_net") print(qua_out)
sub_result = torch.sub(native_out, qua_out) abs_tensor = torch.abs(sub_result)
print("abs_tensor:\n") print(abs_tensor)
drop_last=True num_workers=8 batch_size=64
batch_time = AverageMeter() losses = AverageMeter() top1 = AverageMeter() top5 = AverageMeter() dataset = ImageFolder(root, transform=val_transforms) loader = DataLoader(dataset, batch_size, drop_last=drop_last, num_workers=num_workers)
for batch_idx, batch in enumerate(tqdm(loader)): img, label = batch end = time.time() img, label = img.cuda(), label.cuda() with torch.no_grad(): outputs = qua_net(img) acc1, acc5 = accuracy(outputs.detach(), label, topk=(1, 5)) losses.update(0., outputs.size(0)) top1.update(acc1.item(), outputs.size(0)) top5.update(acc5.item(), outputs.size(0)) batch_time.update(time.time() - end) top1a, top5a = top1.avg, top5.avg
results = OrderedDict( model="UTF-8", top1=round(top1a, 4), top1_err=round(100 - top1a, 4), top5=round(top5a, 4), top5_err=round(100 - top5a, 4), img_size=batch_size) print("qua-benckmark result:") line = 'batch_size:{} * Acc@1 {:.3f} ({:.3f})\t Acc@5 {:.3f} ({:.3f})\t time@avg{:.4f}'.format( batch_size, results['top1'], results['top1_err'], results['top5'], results['top5_err'], batch_time.avg) print("imageNet1k, valadation 5w result:") print(line)
@peri044 Hello, do you have any test scripts related to testing PTQ quantization int8 resnet50? Can you post the relevant code?
CC: @tanayvarshney
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
@lixiaolx Do you still see the same error with the latest main ?
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
Bug Description
Perform int8 quantization on resnet50 in the reference Test-demo ( https://github.com/pytorch/TensorRT/tree/master/tests/py/ptq ), and compare the inference result with the original FP32, the accuracy is quite different
run test code:
import os import torch_tensorrt as torchtrt from torch_tensorrt.logging import * import torch import tensorrt as trt import torch.nn as nn from torch.nn import functional as F import torchvision import torchvision.transforms as transforms import timm
torchtrt.logging.set_reportable_log_level(torchtrt.logging.Level.Graph)
net = timm.create_model('resnet50', pretrained=False)
https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py
checkpoint_path = './resnet50_a1_0-14fe96d1.pth' weights = torch.load(checkpoint_path) net.load_state_dict(weights, strict=True)
net = net.cuda() model = torch.jit.script(net).eval() dummy_min = torch.rand(1, 3, 224, 224).cuda()
class TRTEntropyCalibrator(trt.IInt8EntropyCalibrator2):
testing_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]))
testing_dataloader = torch.utils.data.DataLoader(testing_dataset, batch_size=1, shuffle=False, num_workers=1) calibrator = TRTEntropyCalibrator(testing_dataloader)
compile_spec = { "inputs": [dummy_min], "enabled_precisions": {torch.float, torch.int8}, "calibrator": calibrator, "truncate_long_and_double": True, "device": { "device_type": torchtrt.DeviceType.GPU, "gpu_id": 0, "dla_core": 0, "allow_gpu_fallback": False, } }
qua_net = torchtrt.ts.compile(model, **compile_spec)
native_net = model.cuda() dummy_min = torch.rand(1, 3, 224, 224) native_out = native_net(dummy_min.cuda()) print("native_out") print(native_out) aiak_out = qua_net(dummy_min.cuda()) print("qua_net") print(qua_net)