pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.6k stars 177 forks source link

Very large discrepancy in the quantized model's output compared to the original model when quantizing on CPU #1335

Open JohnnyRacer opened 2 days ago

JohnnyRacer commented 2 days ago

Quantization on GPU works as expected with very small errors, but on CPU there seems to be a problem with the quantized model's output. Here is the code to replicate the problem.

import torch
import torch.nn as nn
from torch.nn import functional as F
from torchao.quantization.quant_api import (
    quantize_,
    int4_weight_only,
)

class TestModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(10, 20)
        self.linear2 = nn.Linear(20, 30)
        self.relu = nn.ReLU()
        self.seq = nn.Sequential(nn.Linear(30,40), nn.ReLU())

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        x = self.seq(x)
        return x

model = TestModel()
cpu_quant_model = TestModel()

device = "cuda:0"

model.to(device)
cpu_quant_model.cpu()

test_input = torch.randn((10, 10), device=device)
original_output = model(test_input)

quantize_(model, int4_weight_only()) # Quantize the model on GPU

quanted_output = model(test_input)
print(F.mse_loss(original_output, quanted_output)) # Only a very small difference of 6.8689e-08 

quantize_(cpu_quant_model, int4_weight_only()) # Quantize the model on CPU

cpu_quanted_output = cpu_quant_model(test_input.cpu())
print(F.mse_loss(original_output, cpu_quanted_output.to(device))) # Getting a large difference of 0.0281 (Close to 50000 times larger error compared to the original?)
jerryzh168 commented 2 days ago

quant_model is not defined, not sure what you mean there,

but this might be known issue: https://github.com/pytorch/ao/issues/1117, that we are fixing in https://github.com/pytorch/ao/pull/1278 which will be landed soon

JohnnyRacer commented 2 days ago

@jerryzh168 Sorry I changed up a few things during test and left out the line where the model was quantized on the CPU, but basically the model's output when quantizing on the CPU and GPU is significantly different. I don't think its related to #1117 since the difference is the same when executing the CPU quantized model on the CPU itself, instead of quantizing on the CPU and executing the model on the GPU.

image

jerryzh168 commented 1 day ago

how do you get the cpu_quant_model? int4_weight_only only works on CUDA IIRC.

JohnnyRacer commented 1 day ago

quantize_(cpu_quant_model, int4_weight_only()) runs fine without any errors or warnings on the CPU. You can run the code snippet I provided above and it should show the difference. cpu_quant_model was always on the CPU and was never moved to the GPU.