microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.6k stars 2.92k forks source link

[Performance] Increasing Memory Usage during INT8 Quantization with ONNX Runtime tools #21979

Open noujaimc opened 1 month ago

noujaimc commented 1 month ago

Describe the issue

Hello,

I'm trying to quantize an ONNX model to INT8 using the ONNX Runtime tools provided here. I have about 1,000 images of size 640x640x3 that I'm using for calibration data. However, when running the following script (run.py):

import os
import argparse
import numpy as np
import onnxruntime
from PIL import Image
from onnxruntime.quantization import CalibrationDataReader, create_calibrator, write_calibration_table, CalibrationMethod

class CalDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder: str, model_path: str, batch_size: int = 1):
        super().__init__()
        self.batch_images = []

        images = []
        for image_name in os.listdir(calibration_image_folder):
            img_path = os.path.join(calibration_image_folder, image_name)
            try:
                images.append(np.array(Image.open(img_path).convert('RGB')).astype(np.float32) / 255.0)
            except Exception as e:
                print(f"Error loading image {img_path}: {e}")

        self.batch_images = [np.stack(images[i:i + batch_size]) for i in range(0, len(images), batch_size)]  
        self.enum = iter(self.batch_images)
        self.input_name = onnxruntime.InferenceSession(model_path, None).get_inputs()[0].name

    def get_next(self):
        next_batch = next(self.enum, None)

        if next_batch is not None:
            return {self.input_name: next_batch}

        return None

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_model", default="./model-infer.onnx", help="input model")
    parser.add_argument("--dataset", default="./cal_images", help="calibration data set")
    args = parser.parse_args()
    return args

def main():
    args = get_args()
    input_model_path = args.input_model
    calibration_dataset_path = args.dataset
    augmented_model_path = input_model_path.replace('.onnx', '.augmented.onnx')

    try: 
        calibrator = create_calibrator(input_model_path, [], augmented_model_path=augmented_model_path, calibrate_method=CalibrationMethod.Entropy)
        calibrator.set_execution_providers(["CUDAExecutionProvider", "CPUExecutionProvider"]) 
        calibrator.collect_data(data_reader= CalDataReader(calibration_dataset_path, input_model_path, 2))

        new_compute_range = {}
        for k, v in calibrator.compute_data().data.items():
            v1, v2 = v.range_value
            new_compute_range[k] = (float(v1.item()), float(v2.item()))

        write_calibration_table(new_compute_range)

        print("Quantized model saved.")
    except Exception as e:
        print("An error occurred:", e)

if __name__ == "__main__":
    main()

I noticed that the memory consumption keeps increasing with every image data sent to the calibrator in the collect_data method of the Calibrator class. The memory usage grows until the system can no longer allocate more memory. It seems that the calibration process retains all intermediate outputs in memory, which doesn't scale well when working with large or multiple images for quantization.

My goal is to use the quantized model with TensorRT.

Is this the correct approach to quantize the model?

I was able to reproduce the problem using the quantization example provided here. To do so, you simply need to copy the images in the test_images folder multiple times.

Thank you

To reproduce

1) Run pre-processing in command line:

python -m onnxruntime.quantization.preprocess --input model.onnx --output model-infer.onnx --auto_merge

2) Run calibration script:

python run.py --input_mode model.onnx --dataset cal_images

If needed, I can send the model and images.

Urgency

For now, I can only calibrate with a couple images.

Platform

Windows

OS Version

Windows 10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

yufenglee commented 1 month ago

@noujaimc, calibrator.collect_data(data_reader) supports to gather statistics information for one part of the data. You can refer an example here: https://github.com/microsoft/onnxruntime-inference-examples/blob/77989cff19f102300e3c4f99b957b55f74daecb4/quantization/object_detection/trt/yolov3/e2e_user_yolov3_example.py#L73

noujaimc commented 1 month ago

Here’s the updated code using multiple readers. It works, but it takes several hours (more than 8h, for 4000 images) to complete on CPU and CUDA. Is this normal? The issue isn’t with the inference step, but with collecting tensor data and making the histogram.

import os
import argparse
import numpy as np
import onnxruntime
from PIL import Image
from onnxruntime.quantization import CalibrationDataReader, create_calibrator, write_calibration_table, CalibrationMethod

class CalDataReader(CalibrationDataReader):
    def __init__(self, calibration_image_folder: str, model_path: str, batch_size: int = 1, start_index: int = 1, end_index: int = 1):
        super().__init__()
        self.batch_images = []

        selected_images = os.listdir(calibration_image_folder)[start_index:end_index]
        print(f"Loading image from {start_index} to {end_index} ", selected_images)

        images = []
        for image_name in selected_images:
            img_path = os.path.join(calibration_image_folder, image_name)
            try:
                image = np.array(Image.open(img_path).convert('RGB')).astype(np.float32) / 255.0
                images.append(image)
            except Exception as e:
                print(f"Error loading image {img_path}: {e}")

        self.batch_images = [np.stack(images[i:i + batch_size]) for i in range(0, len(images), batch_size)]
        self.enum = iter(self.batch_images)
        self.input_name = onnxruntime.InferenceSession(model_path, None).get_inputs()[0].name

    def get_next(self):
        next_batch = next(self.enum, None)

        if next_batch is not None:
            return {self.input_name: next_batch}

        return None

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_model", default="./model-infer.onnx", help="input model")
    parser.add_argument("--dataset", default="./cal_images", help="calibration data set")
    args = parser.parse_args()
    return args

def main():
    args = get_args()
    input_model_path = args.input_model
    calibration_dataset_path = args.dataset
    augmented_model_path = input_model_path.replace('.onnx', '.augmented.onnx')

    try: 
        calibrator = create_calibrator(input_model_path, [], augmented_model_path=augmented_model_path, calibrate_method=CalibrationMethod.Entropy)
        calibrator.set_execution_providers(["CUDAExecutionProvider", "CPUExecutionProvider"]) 

        total_data_size = len(os.listdir(calibration_dataset_path))
        start_index = 0
        batch_size = 5
        stride = 10
        for i in range(0, total_data_size, stride):
            start_index = start_index
            end_index = start_index + stride
            calibrator.collect_data(data_reader = CalDataReader(calibration_dataset_path, input_model_path, batch_size, start_index, end_index))
            start_index += stride

        new_compute_range = {}
        for k, v in calibrator.compute_data().data.items():
            v1, v2 = v.range_value
            new_compute_range[k] = (float(v1.item()), float(v2.item()))

        write_calibration_table(new_compute_range)

        print("Quantized model saved.")
    except Exception as e:
        print("An error occurred:", e)

if __name__ == "__main__":
    main()
github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.