My attempts at Quantization (any advice appreciated)

Search before asking

[X] I have searched the YOLOv3 issues and discussions and found no similar questions.

Question

Hello everyone! :)

My goal for the next few days is to use this repository and implement Post Training Static Quantization and Quantization Aware Training to compare the mAP and inference speed with that of the non-quantized model

To get started, I am using YOLOv3-tiny (416 × 416) as it takes way less time to train (about a day for me). Once the results with YOLOv3-tiny are acceptable, I can move on to the full YOLOv3 (416 × 416)

What have I done so far:

Train YOLOv3-tiny (416 × 416) from scratch using this repository (with no modifications to the code). The below command was used:

python3.8 -m torch.distributed.run --nproc_per_node 4 train.py --data coco.yaml --epochs 300 --weight '' --cfg yolov3-tiny.yaml --img 416 --batch-size 128

This model has an mAP of 0.31, the original Darknet model has an mAP of 0.331. I believe a 0.02 mAP is loss is acceptable.

Use the modifications from another issue (https://github.com/ultralytics/yolov3/issues/1734) to add quantization and dequantization layers in the beginning and end of the model

yolov3-tiny.yaml:
# YOLOv3 🚀 by Ultralytics, AGPL-3.0 license

# Parameters
nc: 80 # number of classes
depth_multiple: 1.0 # model depth multiple
width_multiple: 1.0 # layer channel multiple
anchors:
  - [10, 14, 23, 27, 37, 58] # P4/16
  - [81, 82, 135, 169, 344, 319] # P5/32

# YOLOv3-tiny backbone
backbone:
  # [from, number, module, args]
  [
    [-1, 1, torch.quantization.QuantStub,[]],
    [-1, 1, Conv, [16, 3, 1]], # 0
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 1-P1/2
    [-1, 1, Conv, [32, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 3-P2/4
    [-1, 1, Conv, [64, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 5-P3/8
    [-1, 1, Conv, [128, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 7-P4/16
    [-1, 1, Conv, [256, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 9-P5/32
    [-1, 1, Conv, [512, 3, 1]],
    [-1, 1, nn.ZeroPad2d, [[0, 1, 0, 1]]], # 11
    [-1, 1, nn.MaxPool2d, [2, 1, 0]], # 12
  ]

# YOLOv3-tiny head
head: [
    [-1, 1, Conv, [1024, 3, 1]],
    [-1, 1, Conv, [256, 1, 1]],
    [-1, 1, Conv, [512, 3, 1]], # 15 (P5/32-large)

    [-2, 1, Conv, [128, 1, 1]],
    [-1, 1, nn.Upsample, [None, 2, "nearest"]],
    [[-1, 8], 1, Concat, [1]], # cat backbone P4
    [-1, 1, Conv, [256, 3, 1]], # 19 (P4/16-medium)

    [[19, 15], 1, Detect, [nc, anchors]], # Detect(P4, P5)
  ]

yolo.py:

class Detect(nn.Module):
    # YOLOv3 Detect head for detection models
    stride = None  # strides computed during build
    dynamic = False  # force grid reconstruction
    export = False  # export mode

    def __init__(self, nc=80, anchors=(), ch=(), inplace=True):  # detection layer
        """Initializes YOLOv3 detection layer with class count, anchors, channels, and operation modes."""
        super().__init__()
        self.nc = nc  # number of classes
        self.no = nc + 5  # number of outputs per anchor
        self.nl = len(anchors)  # number of detection layers
        self.na = len(anchors[0]) // 2  # number of anchors
        self.grid = [torch.empty(0) for _ in range(self.nl)]  # init grid
        self.anchor_grid = [torch.empty(0) for _ in range(self.nl)]  # init anchor grid
        self.register_buffer("anchors", torch.tensor(anchors).float().view(self.nl, -1, 2))  # shape(nl,na,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv
        self.inplace = inplace  # use inplace ops (e.g. slice assignment)
        #self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        """
        Processes input through convolutional layers, reshaping output for detection.

        Expects x as list of tensors with shape(bs, C, H, W).
        """
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            x[i] = self.dequant(x[i])

            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                if isinstance(self, Segment):  # (boxes + masks)
                    xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
                    xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
                else:  # Detect (boxes only)
                    xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                    xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

        return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)

    def _make_grid(self, nx=20, ny=20, i=0, torch_1_10=check_version(torch.__version__, "1.10.0")):
        """Generates a grid and corresponding anchor grid with shape `(1, num_anchors, ny, nx, 2)` for indexing
        anchors.
        """
        d = self.anchors[i].device
        t = self.anchors[i].dtype
        shape = 1, self.na, ny, nx, 2  # grid shape
        y, x = torch.arange(ny, device=d, dtype=t), torch.arange(nx, device=d, dtype=t)
        yv, xv = torch.meshgrid(y, x, indexing="ij") if torch_1_10 else torch.meshgrid(y, x)  # torch>=0.7 compatibility
        grid = torch.stack((xv, yv), 2).expand(shape) - 0.5  # add grid offset, i.e. y = 2.0 * x - 0.5
        anchor_grid = (self.anchors[i] * self.stride[i]).view((1, self.na, 1, 1, 2)).expand(shape)
        return grid, anchor_grid

Why use this method? The person who used this method said it worked for him, so, might as well start with something that works.

Anyway, doing a print(model) gives me the following output:

DetectionModel(
  (model): Sequential(
    (0): QuantStub()
    (1): Conv(
      (conv): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv(
      (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(32, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv(
      (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(64, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv(
      (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (9): Conv(
      (conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): Conv(
      (conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (12): ZeroPad2d((0, 1, 0, 1))
    (13): MaxPool2d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False)
    (14): Conv(
      (conv): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(1024, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (15): Conv(
      (conv): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (16): Conv(
      (conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (17): Conv(
      (conv): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(128, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (18): Upsample(scale_factor=2.0, mode='nearest')
    (19): Concat()
    (20): Conv(
      (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
      (act): SiLU(inplace=True)
    )
    (21): Detect(
      (m): ModuleList(
        (0-1): 2 x Conv2d(256, 255, kernel_size=(1, 1), stride=(1, 1))
      )
      (dequant): DeQuantStub()
    )
  )
)

Once again, I trained the model with the below command: python3.8 -m torch.distributed.run --nproc_per_node 4 train.py --data coco.yaml --epochs 300 --weight '' --cfg yolov3-tiny.yaml --img 416 --batch-size 128 This model got an mAP of 0.25

It did not however change my model size, maybe the author of that issue made some other modifications

Instead of adding the dequantization layer the way it was done in the above mentioned issue, I added the dequantization layer inside the YOLOv3-tiny YAML file. It looks as follows:

# Ultralytics YOLOv3 🚀, AGPL-3.0 license

# Parameters
nc: 80 # number of classes
depth_multiple: 1.0 # model depth multiple
width_multiple: 1.0 # layer channel multiple
anchors:
  - [10, 14, 23, 27, 37, 58] # P4/16
  - [81, 82, 135, 169, 344, 319] # P5/32

# YOLOv3-tiny backbone
backbone:
  # [from, number, module, args]
  [
    [-1, 1, torch.quantization.QuantStub,[]],
    [-1, 1, Conv, [16, 3, 1]], # 0
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 1-P1/2
    [-1, 1, Conv, [32, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 3-P2/4
    [-1, 1, Conv, [64, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 5-P3/8
    [-1, 1, Conv, [128, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 7-P4/16
    [-1, 1, Conv, [256, 3, 1]],
    [-1, 1, nn.MaxPool2d, [2, 2, 0]], # 9-P5/32
    [-1, 1, Conv, [512, 3, 1]],
    [-1, 1, nn.ZeroPad2d, [[0, 1, 0, 1]]], # 11
    [-1, 1, nn.MaxPool2d, [2, 1, 0]], # 12
  ]

# YOLOv3-tiny head
head: [
    [-1, 1, Conv, [1024, 3, 1]],
    [-1, 1, Conv, [256, 1, 1]],
    [-1, 1, Conv, [512, 3, 1]], # 15 (P5/32-large)

    [-2, 1, Conv, [128, 1, 1]],
    [-1, 1, nn.Upsample, [None, 2, "nearest"]],
    [[-1, 8], 1, Concat, [1]], # cat backbone P4
    [-1, 1, Conv, [256, 3, 1]], # 19 (P4/16-medium)

    [-1, 1, torch.quantization.DeQuantStub,[]],

    [[19, 15], 1, Detect, [nc, anchors]], # Detect(P4, P5)
  ]

The dequantization layer had to be added before the Detect layer to avoid running into errors.

The mAP was 0.25 (same as the method used in the previous issue)

Current questions I have: I am unsure if this change in mAP by using the quantization and dequantization layers was supposed to happen or not. I did add those layers, but made no other modifications to the model and the inference was performed in float32 mode as well. Yet, the mAP dropped.

I am going to keep up with my attempts to use quantization and get better mAP. Any inputs are greatly appreciated, I am new to all of this

Additional

No response

@jackfaubshner hello! 😊

It's great to see your enthusiasm and detailed approach towards implementing quantization with YOLOv3-tiny. Quantization can indeed be a bit tricky, but you're on the right track. Here are some insights and suggestions that might help you improve your results:

Observations and Suggestions

mAP Drop:
- It's not uncommon to see a drop in mAP when introducing quantization, especially if the model is not fine-tuned post-quantization. Quantization introduces approximation errors which can affect the model's performance.
Quantization Aware Training (QAT):
- Since you've already tried Post Training Static Quantization (PTQ), you might want to explore Quantization Aware Training (QAT). QAT simulates quantization during the training process, allowing the model to adapt to the quantization noise. This often results in better performance compared to PTQ.
- You can follow the PyTorch QAT tutorial to get started.
Model Modifications:
- Ensure that all layers that can benefit from quantization are included. Sometimes, certain layers might not be quantized properly, leading to suboptimal performance.
- Verify that the quantization and dequantization layers are correctly placed. Typically, you want to quantize inputs as early as possible and dequantize outputs as late as possible.
Calibration Dataset:
- For PTQ, the choice of calibration dataset is crucial. Ensure that the calibration dataset is representative of the data the model will see during inference. This helps in better scaling and reduces quantization errors.
Inference Mode:
- Ensure that the model is set to evaluation mode (model.eval()) during inference. This ensures that layers like BatchNorm are not updating their statistics during inference.

Example Code for QAT

Here's a simplified example to get you started with QAT:

import torch
import torch.quantization
from ultralytics import YOLO

# Load your model
model = YOLO("yolov3-tiny.yaml")

# Fuse Conv, BN, and ReLU layers
model.fuse_model()

# Prepare the model for QAT
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Train the model with QAT
# Ensure to use a representative dataset for training
# Example:
# train_model(model, train_loader, epochs=10)

# Convert the model to a quantized version
torch.quantization.convert(model.eval(), inplace=True)

# Save the quantized model
torch.save(model.state_dict(), 'yolov3-tiny-qat.pth')

# Load and run inference
model.load_state_dict(torch.load('yolov3-tiny-qat.pth'))
model.eval()
results = model("https://ultralytics.com/images/bus.jpg")

Additional Resources

Keep experimenting and iterating on your approach. Quantization is a powerful tool, and with the right setup, you can achieve significant improvements in model efficiency with minimal loss in accuracy.

Best of luck with your quantization efforts! If you have any more questions, feel free to ask. 😊

Hello everyone again!

First off, big thanks to @glenn-jocher for your awesome work at Ultralytics!

I really appreciate that you personally reply to every single issue that shows up on Ultralytics repositories. I feel like you are a very down to earth person :)

I got sidetracked the last few days because I wanted to try the LeakyReLU activation function in YOLOv3-tiny (416 × 416) instead of SiLU. My mAP dropped from 0.31 to 0.305, I guess it would be better to stick with SiLU.

Anyway, back to quantization, I have not yet tried Post Training Quantization (PTQ), I am first trying Quantization Aware Training (QAT) and so far have only modified the model to add the quantization and dequantization layers with no other change to the code (mAP dropped to 0.25 from 0.31). I believe that's just adding layers and not actually a proper implementation of Quantization Aware Training (QAT).

Yesterday, I directly modified "train.py" from this repository by adding the following lines to try Quantization Aware Training:

amp = False
model.eval()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
#model = torch.ao.quantization.fuse_modules(model,[['conv', 'bn', 'relu']]) #Threw an error because this repo merges Convolution, BatchNormlization and SiLU (ReLU) into a single block (See "class Conv" in "common.py"
model = torch.ao.quantization.prepare_qat(model.train())

Unfortunately, I ran into a bunch of errors. I fixed as much as I could to get training working but my mAP was 0.0000002 after 10 epochs. Clearly, it was not going to work out. Some of the things implemented in "train.py" are not compatible with a model prepared for quantization. Looks like I am going to have to start from scratch and make my own "train.py".

I have never trained a model from scratch before and I am completely new to this. For reference, I will be using the "train.py" from this repository as well as the simplified example @glenn-jocher has provided above. And I believe I will have to make my own YAML file and separate the Convolution, BatchNormlization and SiLU (ReLU) layers to take advantage of torch.ao.quantization.fuse_modules(model,[['conv', 'bn', 'relu']])

That will be my task for the next few days.

Thank you again @glenn-jocher and everyone at Ultralytics. I will post an update in two or three days.

Any inputs anyone has are greatly appreciated.

Hello @jackfaubshner,

Thank you for your kind words and for sharing your detailed progress! It's fantastic to see your dedication and thorough approach to experimenting with quantization and activation functions. 😊

Addressing Your Current Approach

You're correct that simply adding quantization and dequantization layers is not a full implementation of Quantization Aware Training (QAT). QAT requires the model to be trained with quantization noise simulated during the training process, which helps the model adapt better to the quantized environment.

Modifying `train.py` for QAT

Given the issues you've encountered, here are some steps and tips to help you set up QAT more effectively:

Model Preparation:
- Ensure that your model is correctly set up for QAT by fusing the appropriate layers. Since YOLOv3-tiny uses a custom Conv class that integrates convolution, batch normalization, and activation, you might need to modify the model definition to separate these components for fusion.
Custom Training Script:
- Creating a custom train.py script is a good idea. You can start by simplifying the existing script and gradually adding the necessary components for QAT.
Layer Fusion:
- Modify the model definition to separate convolution, batch normalization, and activation layers. This will allow you to use torch.ao.quantization.fuse_modules.

Example Code for QAT

Here's a more detailed example to help you get started with QAT:

import torch
import torch.quantization
from ultralytics import YOLO

# Load your model
model = YOLO("yolov3-tiny.yaml")

# Modify the model to separate Conv, BN, and ReLU layers if needed
# Example modification (pseudo-code):
# model.backbone = nn.Sequential(
#     nn.Conv2d(...),
#     nn.BatchNorm2d(...),
#     nn.ReLU(...)
# )

# Fuse Conv, BN, and ReLU layers
model.fuse_model()

# Prepare the model for QAT
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Training loop
def train_model(model, train_loader, epochs):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    criterion = torch.nn.CrossEntropyLoss()

    for epoch in range(epochs):
        for images, targets in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

# Example training loop
# train_model(model, train_loader, epochs=10)

# Convert the model to a quantized version
torch.quantization.convert(model.eval(), inplace=True)

# Save the quantized model
torch.save(model.state_dict(), 'yolov3-tiny-qat.pth')

# Load and run inference
model.load_state_dict(torch.load('yolov3-tiny-qat.pth'))
model.eval()
results = model("https://ultralytics.com/images/bus.jpg")

Additional Tips

Calibration Dataset: Ensure you use a representative calibration dataset for better quantization results.
Evaluation Mode: Always set the model to evaluation mode (model.eval()) during inference to ensure proper behavior of layers like BatchNorm.

Resources

Keep up the great work, and don't hesitate to reach out if you have more questions or need further assistance. The YOLO community and the Ultralytics team are here to support you. Looking forward to your updates! 😊

ultralytics / yolov3