grimoire / torch2trt_dynamic

A pytorch to tensorrt convert with dynamic shape support
MIT License
254 stars 34 forks source link

Is "inputs" used when quantizing to int8 with provided dataset? #37

Closed lebionick closed 3 months ago

lebionick commented 3 months ago

I'm trying to quantize my model to int8 and pass the dataset as int8_calib_dataset. But it looks like the calibration is ignoring the data from the dataset and instead optimizing only the single batch that I pass as input. I think so because my model doesn't perform well on data from the dataset and works reasonably well on data from the batch. I also tried to replace it with random values, and the model began to perform even worse. I also can't pass on a larger batch because it exceeds the shape limits. The torch2trt documentation states that the input is ignored if a dataset is passed, but this is not the case. Could you help me with this?

grimoire commented 3 months ago

The torch2trt documentation states that the input is ignored if a dataset is passed

https://github.com/grimoire/torch2trt_dynamic/blob/05c5fdce8db9a8ff74ebbecc5ae23c74a07b7016/torch2trt_dynamic/torch2trt_dynamic.py#L523-L531

Yes, this is the designed behavior. You can pdb here to see if line 525 has been reached. Note that data in the dataset should have the same shape as opt_shape

lebionick commented 3 months ago

My custom dataset is indeed used, for example if I add print to __get_item__ function it appears iterating over it. Also there is incorrectness in the documentation about what should return a dataset. Because of bind_inputs TRT expects dict rather than tensor. However, let me demonstrate some outputs of segmentation model to support my initial state: I am using two sets of images: validation set only in calibration dataset and test set as inputs during quantization and inference. Here is a result when I pass the first batch from the test set as "inputs" parameter (alongside with calibration dataset): a result on the same batch (very good actually) actual_images_inputs_same_inference and a result on another batch from test set (worse) actual_images_inputs_different_inference and here are results when I am using just random tensor as "inputs" parameter: random_images_inputs_1_inference random_images_inputs_2_inference Having this I think there is some kind of bug or misinterpreting of API usage, because clearly inputs parameter alters result significantly

def convert_model():
    model = create_model()
    # inputs = create_input_pack(4)  # first batch from test set
    inputs = torch.randn(4, 3, 512, 512).cuda()
    config = BuildEngineConfig(
        shape_ranges=dict(
            inputs=dict(
                min=(4, 3, 512, 512),
                opt=(4, 3, 512, 512),
                max=(4, 3, 512, 512),
            )
        ),
        int8=True,
        int8_calib_dataset=CalibDataset(),
        int8_calib_algorithm=trt.CalibrationAlgoType.MINMAX_CALIBRATION,
        int8_batch_size=4,
    )
    model_trt = module2trt(model, args=[inputs], config=config)
class CalibDataset:
    def __init__(self):
        self.paths = list(Path("/path/to/validation/set/").glob("*.png"))
        random.seed(0)
        random.shuffle(list(self.paths))
        random.seed(None)
        self.chunked_paths = list(self.chunks(self.paths, 4))

    def __getitem__(self, idx):
        path_chunk = self.chunked_paths[idx]
        images = [load_image(path) for path in path_chunk]
        tensor = preprocess(*images)  # tensor of shape [4, 3, 512, 512]
        return dict(inputs=tensor)

    @staticmethod
    def chunks(lst, n):
        for i in range(0, len(lst), n):
            yield lst[i:i + n]

    def __len__(self):
        return len(self.chunked_paths)
lebionick commented 3 months ago

I launched code with INFO logging, here is my output

[05/17/2024-19:28:48] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 183, GPU 1080 (MiB)
[05/17/2024-19:28:53] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +889, GPU +174, now: CPU 1148, GPU 1254 (MiB)
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.reshape
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
unknown interpolate type, use linear instead.
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
unknown interpolate type, use linear instead.
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_cache_enabled
Warning: Encountered known unsupported method torch.is_autocast_enabled
Warning: Encountered known unsupported method torch.get_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.autocast_increment_nesting
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
unknown interpolate type, use linear instead.
Warning: Encountered known unsupported method torch.autocast_decrement_nesting
Warning: Encountered known unsupported method torch.clear_autocast_cache
Warning: Encountered known unsupported method torch.set_autocast_enabled
Warning: Encountered known unsupported method torch.set_autocast_gpu_dtype
Warning: Encountered known unsupported method torch.set_autocast_cache_enabled
[05/17/2024-19:28:54] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[05/17/2024-19:28:54] [TRT] [I] Graph optimization time: 0.00700145 seconds.
[05/17/2024-19:28:54] [TRT] [W] BuilderFlag::kENABLE_TACTIC_HEURISTIC has been ignored in this builder run. This feature is only supported on Ampere and beyond.
[05/17/2024-19:28:54] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[05/17/2024-19:28:57] [TRT] [I] Detected 1 inputs and 1 output network tensors.
[05/17/2024-19:29:00] [TRT] [I] Total Host Persistent Memory: 258976
[05/17/2024-19:29:00] [TRT] [I] Total Device Persistent Memory: 1459200
[05/17/2024-19:29:00] [TRT] [I] Total Scratch Memory: 0
[05/17/2024-19:29:00] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 12 MiB, GPU 555 MiB
[05/17/2024-19:29:00] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 547 steps to complete.
[05/17/2024-19:29:00] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 351.346ms to assign 160 blocks to 547 nodes requiring 618735104 bytes.
[05/17/2024-19:29:00] [TRT] [I] Total Activation Memory: 618735104
[05/17/2024-19:29:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +591, now: CPU 0, GPU 1145 (MiB)
[05/17/2024-19:29:00] [TRT] [I] Starting Calibration.
CalibDataset is in use!
[05/17/2024-19:29:01] [TRT] [I]   Calibrated batch 0 in 0.687136 seconds.
[05/17/2024-19:29:02] [TRT] [I]   Calibrated batch 1 in 0.63095 seconds.
[05/17/2024-19:29:02] [TRT] [I]   Calibrated batch 2 in 0.629724 seconds.
[05/17/2024-19:29:03] [TRT] [I]   Calibrated batch 3 in 0.641961 seconds.
[05/17/2024-19:29:04] [TRT] [I]   Calibrated batch 4 in 0.633817 seconds.
[05/17/2024-19:29:05] [TRT] [I]   Calibrated batch 5 in 0.636657 seconds.
[05/17/2024-19:29:05] [TRT] [I]   Calibrated batch 6 in 0.641569 seconds.
[05/17/2024-19:29:06] [TRT] [I]   Calibrated batch 7 in 0.647993 seconds.
[05/17/2024-19:29:07] [TRT] [I]   Calibrated batch 8 in 0.667747 seconds.
[05/17/2024-19:29:07] [TRT] [I]   Calibrated batch 9 in 0.649291 seconds.
[05/17/2024-19:29:08] [TRT] [I]   Calibrated batch 10 in 0.64004 seconds.
[05/17/2024-19:29:09] [TRT] [I]   Calibrated batch 11 in 0.658339 seconds.
[05/17/2024-19:29:09] [TRT] [I]   Calibrated batch 12 in 0.649366 seconds.
[05/17/2024-19:29:10] [TRT] [I]   Calibrated batch 13 in 0.675017 seconds.
[05/17/2024-19:29:11] [TRT] [I]   Calibrated batch 14 in 0.651902 seconds.
[05/17/2024-19:29:12] [TRT] [I]   Calibrated batch 15 in 0.651965 seconds.
[05/17/2024-19:29:12] [TRT] [I]   Calibrated batch 16 in 0.647481 seconds.
[05/17/2024-19:29:13] [TRT] [I]   Calibrated batch 17 in 0.665798 seconds.
[05/17/2024-19:29:14] [TRT] [I]   Calibrated batch 18 in 0.660431 seconds.
[05/17/2024-19:29:15] [TRT] [I]   Calibrated batch 19 in 0.668582 seconds.
[05/17/2024-19:29:15] [TRT] [I]   Calibrated batch 20 in 0.65133 seconds.
[05/17/2024-19:29:16] [TRT] [I]   Calibrated batch 21 in 0.663916 seconds.
[05/17/2024-19:29:17] [TRT] [I]   Calibrated batch 22 in 0.654861 seconds.
[05/17/2024-19:29:17] [TRT] [I]   Calibrated batch 23 in 0.661559 seconds.
[05/17/2024-19:29:18] [TRT] [I]   Calibrated batch 24 in 0.667941 seconds.
[05/17/2024-19:29:19] [TRT] [I]   Calibrated batch 25 in 0.670967 seconds.
[05/17/2024-19:29:20] [TRT] [I]   Calibrated batch 26 in 0.668568 seconds.
[05/17/2024-19:29:20] [TRT] [I]   Calibrated batch 27 in 0.664546 seconds.
[05/17/2024-19:29:21] [TRT] [I]   Calibrated batch 28 in 0.668346 seconds.
[05/17/2024-19:29:22] [TRT] [I]   Calibrated batch 29 in 0.66329 seconds.
[05/17/2024-19:29:22] [TRT] [I]   Calibrated batch 30 in 0.669448 seconds.
[05/17/2024-19:29:23] [TRT] [I]   Calibrated batch 31 in 0.671419 seconds.
[05/17/2024-19:29:24] [TRT] [I]   Calibrated batch 32 in 0.665009 seconds.
[05/17/2024-19:29:25] [TRT] [I]   Calibrated batch 33 in 0.669304 seconds.
[05/17/2024-19:29:25] [TRT] [I]   Calibrated batch 34 in 0.669966 seconds.
[05/17/2024-19:29:26] [TRT] [I]   Calibrated batch 35 in 0.668251 seconds.
[05/17/2024-19:29:27] [TRT] [I]   Calibrated batch 36 in 0.677547 seconds.
[05/17/2024-19:29:28] [TRT] [I]   Calibrated batch 37 in 0.66991 seconds.
[05/17/2024-19:29:28] [TRT] [I]   Calibrated batch 38 in 0.669118 seconds.
[05/17/2024-19:29:29] [TRT] [I]   Calibrated batch 39 in 0.674888 seconds.
[05/17/2024-19:29:30] [TRT] [I]   Calibrated batch 40 in 0.682753 seconds.
[05/17/2024-19:29:31] [TRT] [I]   Calibrated batch 41 in 0.679356 seconds.
[05/17/2024-19:29:31] [TRT] [I]   Calibrated batch 42 in 0.670545 seconds.
[05/17/2024-19:29:32] [TRT] [I]   Calibrated batch 43 in 0.660989 seconds.
[05/17/2024-19:29:33] [TRT] [I]   Calibrated batch 44 in 0.671494 seconds.
[05/17/2024-19:29:33] [TRT] [I]   Calibrated batch 45 in 0.669586 seconds.
[05/17/2024-19:29:34] [TRT] [I]   Calibrated batch 46 in 0.668317 seconds.
[05/17/2024-19:29:35] [TRT] [I]   Calibrated batch 47 in 0.675252 seconds.
[05/17/2024-19:29:36] [TRT] [I]   Calibrated batch 48 in 0.667394 seconds.
[05/17/2024-19:29:36] [TRT] [I]   Calibrated batch 49 in 0.674156 seconds.
[05/17/2024-19:29:37] [TRT] [I]   Calibrated batch 50 in 0.654933 seconds.
<i removed some lines, 500 batches in total>
[05/17/2024-19:36:12] [TRT] [I]   Post Processing Calibration data in 60.6313 seconds.
[05/17/2024-19:36:12] [TRT] [I] Calibration completed in 438.268 seconds.
[05/17/2024-19:36:12] [TRT] [I] Writing Calibration Cache for calibrator: TRT-8601-EntropyCalibration2
[05/17/2024-19:36:13] [TRT] [I] Graph optimization time: 0.169575 seconds.
[05/17/2024-19:36:13] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[05/17/2024-19:36:13] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/17/2024-19:39:41] [TRT] [I] Detected 1 inputs and 1 output network tensors.
[05/17/2024-19:39:41] [TRT] [I] Total Host Persistent Memory: 302784
[05/17/2024-19:39:41] [TRT] [I] Total Device Persistent Memory: 1561088
[05/17/2024-19:39:41] [TRT] [I] Total Scratch Memory: 0
[05/17/2024-19:39:41] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 118 MiB, GPU 1152 MiB
[05/17/2024-19:39:41] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 95 steps to complete.
[05/17/2024-19:39:41] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 1.89153ms to assign 6 blocks to 95 nodes requiring 89128960 bytes.
[05/17/2024-19:39:41] [TRT] [I] Total Activation Memory: 89128960
[05/17/2024-19:39:41] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +34, GPU +42, now: CPU 34, GPU 42 (MiB)
CREATED HOST_MEM
[05/17/2024-19:39:41] [TRT] [I] Loaded engine size: 42 MiB
[05/17/2024-19:39:41] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +42, now: CPU 0, GPU 42 (MiB)
[05/17/2024-19:39:41] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +86, now: CPU 0, GPU 128 (MiB)
grimoire commented 3 months ago

I have not bind torch.reshape, but it should be ok if your model is static. Or you can bind it yourself. https://github.com/grimoire/torch2trt_dynamic/blob/05c5fdce8db9a8ff74ebbecc5ae23c74a07b7016/torch2trt_dynamic/converters/view.py#L9-L12 Note that any unsupported method would be converted to const op in the graph.

The calibration looks ok. The result might come from the outlier of the intermediate value. You can locate the outliner by convert part of the model and compare the outputs. Once the outliner been located, scale the value around the op might be helpful.

out = out / scale
out = quant_op(out)
out = out * scale
lebionick commented 3 months ago

Thank you, I'll try it

lebionick commented 3 months ago

Indeed, there was a problem with torch.reshape: I changed it with tensor.reshape inside my model and it started to crash at least (seems that initially part of network was replaced with constant values produced by the first batch).
I tried different calibrator, input values and different version of environment. Error messages are slightly different, but I think it is all about the same thing.

[05/21/2024-18:15:32] [TRT] [E] 2: [weightConvertors.cpp::quantizeBiasCommon::337] Error Code 2: Internal Error (Assertion getter(i) != 0 failed. )
[05/21/2024-18:15:32] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
activation(108): error: identifier "inff" is undefined
      dst0 = tmp * inff;
1 error detected in the compilation of "activation".
[05/21/2024-17:59:28] [TRT] [E] 2: [quantization.cpp::DynamicRange::80] Error Code 2: Internal Error (Assertion min_ <= max_ failed. )
[05/21/2024-17:59:28] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

Then I tried converting to both FP16 and FP32 instead of INT8 and it succeeded, except for the fact it produces Nan values all the time. Looks like the network is very sensitive to precision inside some layers and includes some complicated reshapes, for instance: https://github.com/mit-han-lab/efficientvit/blob/master/efficientvit/models/nn/ops.py#L402

What would you recommend to tackle this?

P.S. I also tried converting to onnx and using polygraphy. According to logs it skipped quantization of many layers, resulting in increasing of inference time :)

grimoire commented 3 months ago

You can add some intermediate outputs to the final outputs. If any of them outputs an unexpected tensor, that means something wrong happened in the previous layers. Repeat the step and you would locate the layer that brought the error.

lebionick commented 3 months ago

I overcome this limitation of torch2trt using torch.onnx.export with correct parameters and then converting using polygraphy with entropy2 calibrator. However, still there are some layers that do not support translation into int8, but converter ignores them, using FP32. I've got decent quality after reducing precision, it is far worse than fp16 and speed up is quite small. So I stick with onnx + trtexc with fp16.