Open Urania880519 opened 1 month ago
Did you follow this tutorial? https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html
@narendasan I've followed both the tutorial you provided and this one: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html#dynamic-shapes However, there's this error after finishing calibration(the calibration seemed successful and the loss was quite low) This is the code I used:
quant_cfg = mtq.INT8_DEFAULT_CFG
mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
with torch.no_grad():
with export_torch_mode():
input_tensor = torch.randn((1, channels, 35, 35), dtype=torch.float32).to('cuda')
height_dim = torch.export.Dim("height_dim", min=25, max=64)
width_dim= torch.export.Dim("width_dim", min=25, max=64)
dynamic_shapes = ({2: height_dim, 3: width_dim},)
from torch.export._trace import _export
exp_program = _export(model, (input_tensor,), dynamic_shapes= dynamic_shapes)
trt_Qmodel = torchtrt.dynamo.compile(
exp_program,
inputs=[input_tensor],
enabled_precisions={torch.int8},
min_block_size=1,
debug=False,
assume_dynamic_shape_support= True
)
@lanluo-nvidia or @peri044 can you provide additional guidance here?
@Urania880519
if you could paste the full code, I can try to reproduce on my side to know what is the exact issue you are facing.
Also the in8 quantization support was introduced in 2.5.0 version, if you could try with the 2.5.0 pytorch and torch_tensorrt.
in terms of dynamic shape support in torch_tensorrt, if you have Custom Dynamic Shape Constraints, please refer this tutorial: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html via torch.export.export()
❓ Question
I have a PTQ model and a QAT model trained with the official pytorch API following the quantization tutorial, and I wish to deploy them on TensorRT for inference. The model is metaformer-like using convolution layers as token mixer. One part of the quantized model looks like this:
What you have already tried
I have tried different ways to make things work:
model_trt = torch2trt( model_fp32, [torch.randn(1, 11, 64, 64).to('cuda')], max_batch_size=batch_size, fp16_mode=False, int8_mode=True, calibrator= trainLoader, input_shapes=[(None, 11, None, None)] )
trt_gm = torch.compile( model, dynamic= True, backend="tensorrt",)
torch.onnx.export( quantized_model, dummy_input, args.onnx_export_path, input_names=["input"], output_names=["output"], opset_version=13, export_params= True, keep_initializers_as_inputs=False, dynamic_axes= {'input': {0:'batch_size', 2: "h", 3: "w"}, 'output': {0:'batch_size', 2: "h", 3: "w"} } )
Environment
conda
,pip
,libtorch
, source): condaAdditional context
Personally I think the torch.compile() API is the most possible for me to successfully convert the quantized model since there's no performance drop. Does anyone has relevant experience on handling quantized model?