openvinotoolkit / nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference
Apache License 2.0
951 stars 237 forks source link

Different model output after compression in NNCFNetwork and .onnx/.xml #1284

Closed korotaS closed 1 year ago

korotaS commented 2 years ago

Hi! I am compressing a pytorch DB model (like this) with quantization algorithm. I do it strictly as it is specified in the documentation and examples. After the last step:

compression_ctrl, compressed_model = create_compressed_model(torch_model, nncf_config)

I get a compressed_model which is a NNCFNetwork wrapper of my torch model. If I test its quality right away I get roughly the same metrics as a non-quantized model, which tells me that I am doing everything correctly (metrics are computed in the same way in PTQ/QAT pytorch examples). But obviously I need to export it to ONNX and then to OpenVino's IR. First ONNX:

compression_ctrl.export_model('model.onnx', save_format='onnx_12')  # I need opset higher than 10

And then to IR with MO:

mo --input_model model.onnx --input_shape "[1,3,1..2048,1..2048]" --output_dir model_ir/

The problem comes when I check inference of my .onnx or .xml model with OpenVino runtime: it produces some random output... I tried to debug the .onnx model and I noticed that if I ignore ReLU activations in NNCFConfig and compress it again with new config, then the model starts producing good output but it becomes much slower (I suppose because half of the network is at full precision). Do you have any idea why could it happen? Any ideas would be appreciated.

NNCF version: 2.2.0, OpenVino version: 2022.2.0, torch version: 1.9.1

AlexKoff88 commented 2 years ago

Hi @korotaS,

Can you please share the NNCF config?

korotaS commented 2 years ago

@AlexKoff88 of course:

nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 512, 512]},
    "log_dir": 'notebooks/nncf_logs/', 
    "compression": {
        "algorithm": "quantization",  # specify the algorithm here,
        "preset": "performance",
        "ignored_scopes": ["{re}.*StepFunctionCatInter*"]
    },
}

About StepFunctionCatInter - I have some mathematical operations at the and (add, exp, interpolate) which don't need to be quantified (I tested without "ignored_scopes": ["{re}.*StepFunctionCatInter*"] and the output from NNCFNetwork was very bad. If I ignore the last operations - the output becomes good, however, the ONNX output still seems random.) I can also provide onnx model: db_nncf_quant.onnx.zip

korotaS commented 2 years ago

So I think I located the problem, although I don't know what is causing the issue. There will be quite a lot of code here, but just to reproduce the issue. I create a simple class, which simulates one block from my network:

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, bias=False)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(3, 32, 3, bias=False)
        self.bn2 = nn.BatchNorm2d(32)
        self.act = nn.ReLU()
        self.conv3 = nn.Conv2d(32, 64, 3, bias=False)

    def forward(self, x):
        x1 = self.conv1(x)
        x1 = self.bn1(x1)
        x2 = self.conv2(x)
        x2 = self.bn1(x2)
        x1_x2 = x1 + x2
        x1_x2 = self.act(x1_x2)
        x3 = self.conv3(x1_x2)
        return x3

And try to compress it:

nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 512, 512]},
    "log_dir": 'notebooks/nncf_logs/',  # log directory for NNCF-specific logging outputs
    "compression": {
        "algorithm": "quantization",  # specify the algorithm here,
    },
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, DefaultInitializingDataLoader(data_loader))
compression_ctrl, model_quant = create_compressed_model(model, nncf_config)
compression_ctrl.export_model('model.onnx', save_format='onnx_10')

Also I create a class for inference in OpenVino:

import openvino.runtime as ov
import numpy as np
import torch

class OpenVinoEngine():
    def __init__(self, model_fpath, num_workers):
        self.core = ov.Core()
        self.raw_model = self.core.read_model(model_fpath, "AUTO")
        self.raw_model.reshape([1, 3, (1, 2048), (1, 2048)])
        self.model = self.core.compile_model(self.raw_model, "CPU", config={"INFERENCE_NUM_THREADS": str(num_workers)})
        self.infer_request = self.model.create_infer_request()

    def process(self, batch):
        self.infer_request.infer([batch])
        output = [out.data[:] for out in self.infer_request.output_tensors]
        if len(output) > 1:
            return output
        return output[0]

Then I do the inference in NNCFNetwork (which is model_quant) and in OpenVino (from .onnx path) and compare them with simple MSE:

inp = torch.rand(1, 3, 2048, 2048) * 10
engine = OpenVinoEngine('model.onnx', 4)
out_ov = engine.process(inp.numpy())
out_nncf = model_quant(inp)
diff = torch.nn.functional.mse_loss(out_nncf, torch.from_numpy(out_ov)).item()  # something like 0.01

So I see that the outputs are not equal. When I add this key/value to config: "ignored_scopes": ["{re}.*__add___*"], the diff variable becomes 7.355652198448581e-11, so the output is almost fully equal, which is expected. As I see from Netron (debugging .onnx model), the difference is in FakeQuantize before Add (model from first image yields bad output and from the second image - good output):

image image

If I add "ignored_scopes": ["{re}.*__add___*"] to my original DB model, the output becomes OK but the inference time increases. I suppose because the following ReLU layers after Add layers also become ignored, thus non quantized and at full precision. Compression logs:

...
INFO:nncf:Ignored adding Activation input quantizer for: 5 Model/__add___0
6 Model/ReLU[act]/relu_0
...

Maybe this is known problem? Or are there any ways to change compression config or model code to overcome this issue?

AlexKoff88 commented 2 years ago

Thanks, @korotaS! This looks like an absence of the specific fusion pattern in the NNCF PT version. Is it possible for you to share the quantized and not quantized models to check the performance on our side? Random weights are ok.

korotaS commented 2 years ago

@AlexKoff88 can you please specify which models? From Model class or from my DB class? Also in which format - non-quantized in .pt (torch.save or torch.jit.save) and quantized in .onnx?

AlexKoff88 commented 2 years ago

@AlexKoff88 of course:

nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 512, 512]},
    "log_dir": 'notebooks/nncf_logs/', 
    "compression": {
        "algorithm": "quantization",  # specify the algorithm here,
        "preset": "performance",
        "ignored_scopes": ["{re}.*StepFunctionCatInter*"]
    },
}

About StepFunctionCatInter - I have some mathematical operations at the and (add, exp, interpolate) which don't need to be quantified (I tested without "ignored_scopes": ["{re}.*StepFunctionCatInter*"] and the output from NNCFNetwork was very bad. If I ignore the last operations - the output becomes good, however, the ONNX output still seems random.) I can also provide onnx model: db_nncf_quant.onnx.zip

I looked at the model you attached in the previous post and it does not look good to me. As far as I understood you the original problem is the low accuracy after the quantization, right? Have you tried to fine-tune the quantized model? Can you please share the quantized model without any ignored_scope (the one you got at during the first try with NNCF)?

korotaS commented 2 years ago

The accuracy right after quantization (without ignore_scope) is almost zero so I thought that is seems odd. I tried to finetune it but it takes a couple of days to converge (there are some data and resource restrictions) so I just watched at first 10-20 epochs and the metric was still zero. I've quantized models before and they always show pretty good metrics right after quantization and further fine-tune increases the metric just by 1-2% (and the same behavior can be seen from NNCF examples). Original quantized model without ignore_score (produces random output): db_nncf_quant_no_ignore.onnx.zip

AlexKoff88 commented 2 years ago

Thanks, I noticed a couple of problems. One is related to the NNCF quantization: image FakeQuantize should not be propagated through the ReLU. This leads to the fact that we use the wrong scales to approximate the output of the preceding Add operation while this output will be clamped by ReLU so that we are losing in representation. I would ask @vshampor to comment regarding the first issue. Maybe we can fix it quickly, so that you can use the latest version from develop branch and try again.

Another problem can be related to the fact that we quantize the bottom part of the model which is usually sensitive to the accuracy: image It is better to add all these operations in the ingnored scope.

korotaS commented 2 years ago

@AlexKoff88 Thank you for the explanation! The bottom part of the network (operations sub, mul, exp, add, div and concat) are now added to separate layer called StepFunctionCatInter which is added to ignoredscope, so it doesn't affect the performance. At this moment I added `add` to ignored_scope too and It works fine (accuracy-wise). But if there will be a more elegant and correct solution, it would be very helpful!

MaximProshin commented 1 year ago

@korotaS , as I see from your last comment, you were able to solve the accuracy problem. Do you have any objections if we close this issue?

korotaS commented 1 year ago

@MaximProshin yes, we can close it.