pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.58k stars 350 forks source link

🐛 [Bug] Can't do quantization aware training for ViT #2101

Closed proevgenii closed 11 months ago

proevgenii commented 1 year ago

Bug Description

I'm trying to apply quantization aware training(QAT) procedure to ViT model Using this example notebook: https://github.com/pytorch/TensorRT/blob/main/notebooks/qat-ptq-workflow.ipynb Get error:

ERROR: [Torch-TensorRT TorchScript Conversion Context] - 10: Could not find any implementation for node [Freeze Tensor 0x39ab1d18 ] + (Unnamed Layer* 5) [Quantize] + (Unnamed Layer* 7) [Convolution].
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node [Freeze Tensor 0x39ab1d18 ] + (Unnamed Layer* 5) [Quantize] + (Unnamed Layer* 7) [Convolution].)

To Reproduce

I do everything like in the example notebook section 4. Quantization Aware Training

  1. So I load my pretrained ViT model from timm:
    model_name = 'vit_base_patch32_224_clip_laion2b'
    q_model = timm.create_model(model_name,  pretrained = True, num_classes=num_cls, exportable=True, scriptable=True)        
    model_dct = torch.load(checkpoints_path, map_location = device)
    q_model.load_state_dict(model_dct['state_dict_ema'])
    q_model = q_model.eval().to(device)
  2. Run this functions, and it gives me many warnings
    #Calibrate the model using percentile calibration technique.
    with torch.no_grad():
    collect_stats(q_model, dataloader_train, num_batches=32)
    compute_amax(q_model, method="max")
Cutted Output here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.81it/s]
WARNING: Logging before flag parsing goes to stderr.
W0712 09:36:27.717221 140164031010624 tensor_quantizer.py:173] Disable MaxCalibrator
....
W0712 09:36:27.780649 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
W0712 09:36:27.781133 140164031010624 tensor_quantizer.py:238] Call .cuda() if running on GPU after loading calibrated amax.
W0712 09:36:27.781711 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1, 1, 1]).
W0712 09:36:27.782293 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
W0712 09:36:27.782825 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([2304, 1]).
W0712 09:36:27.783268 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
W0712 09:36:27.783745 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([768, 1]).
W0712 09:36:27.784252 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
W0712 09:36:27.784877 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([3072, 1]).
W0712 09:36:27.785372 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([]).
........
W0712 09:36:27.831604 140164031010624 tensor_quantizer.py:237] Load calibrated amax, shape=torch.Size([6, 1]).

  1. Then run train, exact as in example notebook
  2. Get the jit_model with the code below:
    quant_nn.TensorQuantizer.use_fb_fake_quant = True
    with torch.no_grad():
    data = iter(dataloader_test)
    images, _ = next(data)
    jit_model = torch.jit.trace(q_model, images.to("cuda"))
    torch.jit.save(jit_model, "model_qat.jit.pt")quant_nn.TensorQuantizer.use_fb_fake_quant = True
Also get warnings(cutted output)
E0712 09:39:10.011697 140164031010624 tensor_quantizer.py:120] Fake quantize mode doesn't use scale explicitly!
......
W0712 09:39:10.715657 140164031010624 tensor_quantizer.py:280] Use Pytorch's native experimental fake quantization.
E0712 09:39:11.219808 140164031010624 tensor_quantizer.py:120] Fake quantize mode doesn't use scale explicitly!
...

  1. compiling it into a TensorRT model
    #Loading the Torchscript model and compiling it into a TensorRT model
    qat_model = torch.jit.load("model_qat.jit.pt").eval()
    compile_spec = {"inputs": [torch_tensorrt.Input([64, 3, 224, 224])],
                "enabled_precisions": torch.int8,
                "truncate_long_and_double": True
               }
    trt_mod = torch_tensorrt.compile(qat_model, **compile_spec,)
This code raise RuntimeError
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[14], line 7
      2 qat_model = torch.jit.load("model_qat.jit.pt").eval()
      3 compile_spec = {"inputs": [torch_tensorrt.Input([64, 3, 224, 224])],
      4                 "enabled_precisions": torch.int8,
      5                 "truncate_long_and_double": True
      6                }
----> 7 trt_mod = torch_tensorrt.compile(qat_model, **compile_spec,)

File /usr/local/lib/python3.8/dist-packages/torch_tensorrt/_compile.py:125, in compile(module, ir, inputs, enabled_precisions, **kwargs)
    120         logging.log(
    121             logging.Level.Info,
    122             "Module was provided as a torch.nn.Module, trying to script the module with torch.jit.script. In the event of a failure please preconvert your module to TorchScript",
    123         )
    124         ts_mod = torch.jit.script(module)
--> 125     return torch_tensorrt.ts.compile(
    126         ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs
    127     )
    128 elif target_ir == _IRType.fx:
    129     if (
    130         torch.float16 in enabled_precisions
    131         or torch_tensorrt.dtype.half in enabled_precisions
    132     ):

File /usr/local/lib/python3.8/dist-packages/torch_tensorrt/ts/_compiler.py:136, in compile(module, inputs, input_signature, device, disable_tf32, sparse_weights, enabled_precisions, refit, debug, capability, num_avg_timing_iters, workspace_size, dla_sram_size, dla_local_dram_size, dla_global_dram_size, calibrator, truncate_long_and_double, require_full_compilation, min_block_size, torch_executed_ops, torch_executed_modules)
    110     raise ValueError(
    111         f"require_full_compilation is enabled however the list of modules and ops to run in torch is not empty. Found: torch_executed_ops: {torch_executed_ops}, torch_executed_modules: {torch_executed_modules}"
    112     )
    114 spec = {
    115     "inputs": inputs,
    116     "input_signature": input_signature,
   (...)
    133     },
    134 }
--> 136 compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
    137 compiled_module = torch.jit._recursive.wrap_cpp_module(compiled_cpp_mod)
    138 return compiled_module

RuntimeError: [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:169] Building serialized network failed in TensorRT

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • I'm running this inside Nvidia container: nvcr.io/nvidia/pytorch:23.04-py3
  • Torch-TensorRT Version (e.g. 1.0.0): '1.4.0.dev0'
  • PyTorch Version (e.g. 1.0): '2.1.0a0+fe05266'
  • Python version: Python 3.8.10
  • CUDA version: 12.1
  • GPU models and configuration: Tesla T4
proevgenii commented 1 year ago

Any updates?)

proevgenii commented 1 year ago

@peri044 Any updates?

github-actions[bot] commented 11 months ago

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days