[Bug]: Accuracy issue on various CPU platforms with MobileNetV3-Large and ShuffleNetV2-x0.5

openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference

https://docs.openvino.ai

Apache License 2.0

6.6k stars 2.13k forks source link

[Bug]: Accuracy issue on various CPU platforms with MobileNetV3-Large and ShuffleNetV2-x0.5 #25342

Closed rememberBr closed 2 weeks ago

rememberBr commented 1 month ago

OpenVINO Version

2024.0 2024.1，2024.2

Operating System

Windows System

Device used for inference

CPU

Framework

None

Model used

mobilenetV3，shuffleNetV2

Issue description

After the model was quantized to int8, the inference results were inconsistent on different CPUs. I tried the same model on four CPUs: i5-7400, i7-8700, i7-12700, and i5-12500. The inference results were different. On the same CPU but different computers, the results seemed to be consistent. The FP32 xml and bin models had consistent results on different computers and different CPUs, but the int8 model had inconsistent results on different CPU models.

Step-by-step reproduction

Directly pull the official MobileNetV3-Large and ShuffleNetV2-x0.5 models, quantize them to int8, conduct multiple inferences on different CPUs, and check whether the results are completely consistent.

Relevant log output

No response

Issue submission checklist

[X] I'm reporting an issue. It's not a question.
[X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
[X] There is reproducer code and related data files such as images, videos, models, etc.

wenjiew commented 1 month ago

@rememberBr Can you share more details on the way you quantize the models? Thanks!

rememberBr commented 1 month ago

@wenjiew Hello, I'm using the nncf.quantize_with_accuracy_control API of openvino 2024.0.0 and nncf 2.10.0 for quantization on the Intel(R) Xeon(R) Gold 6151 CPU. Quantization is done using Ubuntu 20.04, and the inference is on the Windows 10 system. On the same CPU, the results of openvino 2024.1 and 2024.2 are not the same. On different CPU, even if the openvino version is the same, the results will also be different.

yuxu42 commented 1 month ago

@rememberBr Could you share the quantized model you're using? It will be helpful for us to quickly reproduce. Thanks!

rememberBr commented 1 month ago

@yuxu42 I can share some of the models, and the problem may mainly come from here。 shuffleNetV2_x05.zip Please let me know if there is any progress in the testing, including testing of other models. Thank you

yuxu42 commented 1 month ago

@rememberBr btw, two more questions: a. Which platform you used to quantize the model? b. Is the result between i7-12700 and i5-12500 different or not? Is the result between i5-7400, i7-8700 different or not?

liubo-intel commented 1 month ago

@yuxu42 I can share some of the models, and the problem may mainly come from here。 shuffleNetV2_x05.zip Please let me know if there is any progress in the testing, including testing of other models. Thank you

Hi, @rememberBr : could you please also provide the FP32 model of this shuffleNetV2_x05. and by the way, about the 'inconsistent results' you mentioned, did you observe this by openvino classification samples?(e.g. this one with the same input image ), What is the approximate magnitude of the discontinuity？(the label result changes or the probability changes with the same label?) these information may help us quick locate the issue, thanks

rememberBr commented 1 month ago

@yuxu42 a: I'm not sure what platform you're referring to, but let me introduce it as much as possible. The system is Ubuntu 20.04 for x64, the language is Python, and the OpenVino and NNCF versions are 2024.0.0 and 2.10.0, respectively. The CPU is Intel's Xeon series, and the specific models are listed above. b: Sorry, I need to make a correction. The results for i7-12700 and i5-12500 are the same, and the results for i5-7400 and i7-8700 are also the same. However, the results for i7-12700 and i5-7400 are different. The results of different versions of OpenVino (tested on 2024.1.0 and 2024.2.0) with the same CPU are also different

rememberBr commented 1 month ago

@liubo-intel The corresponding fp32 model is here： fp32.zip

Regarding the specific manifestation of "inconsistent results", I did not carefully compare the differences in output values of the same image. Instead, after testing multiple images, I counted the number of images belonging to a certain category. On the 12th CPU, the number of images in that category was 693, and on the 7th and 8th CPUs, the measured result was 702

liubo-intel commented 1 month ago

Hi, @rememberBr : thanks for the information. it seems even the label result changes. we will have a look at this issue soon. I'm not sure, but is it possible that you upload one of the images(can be used as input data during reproduce and debug this issue) which is in the category on 7th and 8th CPUs while not in the category On the 12th CPU(maybe by the difference of '702 name list' and '693 name list')?

rememberBr commented 1 month ago

@liubo-intel I think it's not necessary to specifically search for images with different labels to see the differences in results, such as test.png: test

On the i5-7400, the result of one of the models is: i5-7400

On i7-12700, the result is:

It can be seen that there is a difference in values, although it may not be significant, from a probability perspective, there must be certain images where the probabilities of two or more categories corresponding to them (after softmax) are extremely close. At this point, a slight numerical shift will lead to the final category difference. I think

liubo-intel commented 3 weeks ago

Hi, @rememberBr : the issue can be reproduced on our side. and these difference of the two kinds of platforms(12th gen CPU and 7/8 gen CPU) mainly comes from one kind of 1x1 convolution implement. this kind of 1x1 convolution used different hardware instructions for int8 model on these platforms(and commonly 12th gen used more effective ones). since for common cases different hardware instructions may lead to slight output difference, but for your cases, it seems the difference is quite large. And we will sync with hardware instructions team about this. But before that could you please help double check that is this kind of difference is improvement or regression on your your actual use case? I mean as you mentioned 'On the 12th CPU, the number of images in that category was 693. on the 7th and 8th CPUs, the measured result was 702' are these extra 702-693=9 images are classified correctly or incorrectly？

And by the way, we found there was a similar reported pot_saturation_issue, as described in that issue report 'older Intel CPU generations(e.g. 7/8 gen CPU) have some accuracy issue for some int8 models, while CPUs with Intel Deep Learning Boost (VNNI) technology(e.g. 12 gen CPU) doesn't have this kind of accuracy issue.' if the accuracy difference issue you are facing is the same as the reported 'pot_saturation_issue'. the accuracy difference seems good for 12 gen CPU. and there are some Workaround(as described in the report) during model quantization to improve the accuracy for 7/8 gen CPU, For Your Reference.

rememberBr commented 3 weeks ago

@liubo-intel Thank you very much. If the problem lies in the differences in the invocation of underlying hardware instructions, is it a problem that you cannot solve? Because it involves the hardware part, and the lower-generation hardware may not support some efficient instructions on the new hardware?

It's hard to say whether this is a good or bad influence. In our multi-classification problem, the influence may be that one class is a little better and another class is a little worse. For me, the biggest risk lies in that: the quantization model, the screening model, and the deployment model on different hardware may affect the screening of the model. For example, the quantization model is completed on the computing server using Intel's Xeon CPU. Multiple quantized models are handed over to the algorithm test engineer to test with an independent test set to select the optimal model and finally deploy it on Intel's 12th or 13th generation CPUs. The conclusions drawn from different hardware at different stages may not be the optimal on the deployment platform.

Regarding the pot_saturation_issue you mentioned, since I use NNCF for quantization, I'm not sure if it's the same problem because in your link it is mentioned: "If you observe the saturation issue, try the “all” option during model quantization. If the accuracy problem still occurs, try using Quantization-aware training from NNCF and fine-tuning the model." It seems that NNCF can avoid this problem? (Or is it that using quantization-aware training to improve accuracy indirectly avoids the decrease in accuracy?)

Finally, I still have a doubt. Will there also be differences between the results on the Xeon series CPUs and the results on the Intel 12th generation CPUs? Because I used nncf.quantize_with_accuracy_control, which adjusts the quantization strategy according to the accuracy loss. Is it possible that the quantization strategy that nncf considers better on the Xeon CPU is not the optimal on the Core series? If so, do I need to ensure that quantization, testing, screening, and deployment are carried out on devices with the same configuration?

liubo-intel commented 3 weeks ago

Hi, @rememberBr : some answers to your questions

@liubo-intel Thank you very much. If the problem lies in the differences in the invocation of underlying hardware instructions, is it a problem that you cannot solve? Because it involves the hardware part, and the lower-generation hardware may not support some efficient instructions on the new hardware? yes, it seems so. these VNNI instructions can't be ported to platforms which not support them.

It's hard to say whether this is a good or bad influence. In our multi-classification problem, the influence may be that one class is a little better and another class is a little worse. For me, the biggest risk lies in that: the quantization model, the screening model, and the deployment model on different hardware may affect the screening of the model. For example, the quantization model is completed on the computing server using Intel's Xeon CPU. Multiple quantized models are handed over to the algorithm test engineer to test with an independent test set to select the optimal model and finally deploy it on Intel's 12th or 13th generation CPUs. The conclusions drawn from different hardware at different stages may not be the optimal on the deployment platform. *considering your cases, it seems better to use same kind of cpu platform series (e.g. platforms both support 'vnni' instructions(can be got this kind of information by CPU 'Flags', e.g. by 'lscpu' command on Linux OS, if 'vnni' label is found in 'Flags' items, it means this kind of CPU support vnni instructions)) for quantization model test and deploy.**

Regarding the pot_saturationissue you mentioned, since I use NNCF for quantization, I'm not sure if it's the same problem because in your link it is mentioned: "If you observe the saturation issue, try the “all” option during model quantization. If the accuracy problem still occurs, try using Quantization-aware training from NNCF and fine-tuning the model." It seems that NNCF can avoid this problem? (Or is it that using quantization-aware training to improve accuracy indirectly avoids the decrease in accuracy?) sorry I'm not expert on NNCF component. but from my understanding, it means 'using quantization-aware training to improve accuracy indirectly avoids the decrease in accuracy' on these kind of cpu platforms which not support vnni instructions(e.g. 7/8 gen CPU). Hi @KodiaqQ do you have any suggestions about this?_

Finally, I still have a doubt. Will there also be differences between the results on the Xeon series CPUs and the results on the Intel 12th generation CPUs? Because I used nncf.quantize_with_accuracycontrol, which adjusts the quantization strategy according to the accuracy loss. Is it possible that the quantization strategy that nncf considers better on the Xeon CPU is not the optimal on the Core series? If so, do I need to ensure that quantization, testing, screening, and deployment are carried out on devices with the same configuration? From my understanding, this kind of saturation_issue also can be existent on Xeon series CPUs if they not support vnni instructions. so to avoid this kind of issue, it seems better to retrain the quantization-aware model on older Intel CPU generations or use CPU generations which support vnni(no matter Xeon or Core) for quantization, testing, screening, and deployment. @KodiaqQ do you have any suggestions about this?_

KodiaqQ commented 3 weeks ago

Hi @liubo-intel, @rememberBr From the NNCF's perspective, it looks like the saturation issue. It may be easily fixed in NNCF using advanced quantization parameters. But at first, as I see from the details, the nncf.quantize_with_accuracy_control API was used. May I ask @rememberBr, why did you choose it? This API method utilizes the advanced algorithm for quantization and tries to push the model to the predefined limits by returning some layers in FP32 precision. This method may consume much time and memory for the quantization.

Let me show the example of the saturation issue fix (nncf.OverflowFix as part of the nncf.AdvancedQuantizationParameters option) using the most common and fastest quantization flow:

quantized_model = nncf.quantize(
    model,
    calibration_dataset,
    ...,
    advanced_parameters=nncf.AdvancedQuantizationParameters(overflow_fix=nncf.OverflowFix.FIRST_LAYER),
)

Here are the details for the overflow fix option - https://github.com/openvinotoolkit/nncf/blob/develop/nncf/quantization/advanced_parameters.py#L32-L58 If you will observe differences even with this option, try to use nncf.OverflowFix.ENABLE.

I recommend using the nncf.quantize as the basic approach for the model quantization to start with. This method allows get fully quantized and the most performed model quickly and simply. It has a bunch of different parameters that may help with the accuracy. The nncf.quantize_with_accuracy_control is the next step in the scenario when the previous flow does not give you good accuracy.

Here is the example of the saturation issue fix with the nncf.quantize_with_accuracy_control also:

quantized_model = nncf.quantize_with_accuracy_control(
    model,
    calibration_dataset,
    validation_dataset,
    validation_fn,
    ...,
    advanced_parameters=nncf.AdvancedQuantizationParameters(overflow_fix=nncf.OverflowFix.FIRST_LAYER),
)

If it doesn't help, please let me know and we'll try to help you.

rememberBr commented 2 weeks ago

@liubo-intel @KodiaqQ First of all, I would like to thank both of you for your help.And I'm sorry for being too busy lately and only replying now.

Regarding Mr. @liubo-intel 's reply, I initially attempted to verify whether the CPUs I was using contained the VNNI instruction. I used the CPU-Z software, the Intel® Processor Identification Utility software, and the benchdnn of oneDNN to verify whether the Xeon processors, i5-7400 and i7-12700 I was using contained the VNNI instruction, but the results showed that none did. Further inquiries revealed that currently, VNNI is an extended instruction set of AVX512, usually named AVX512_VNNI, and AVX512_VNNI is part of DL Boost (Intel® Processor Identification Utility shows that i7-12700 does not support AI Boost. I'm not sure if this is DL Boost). I began to suspect whether the problem was caused by VNNI, but based on Mr. @KodiaqQ 's reply, I did solve the problem, so the root cause was still on the saturation issue. So, is it because the AVX512 instruction set was not explicitly enabled in the bios that I couldn't query it? And does openvino not require the explicit enabling of related instruction sets and can be called forcibly from the bottom layer? (This question is actually not that important to me because the final problem has indeed been solved)

Thank you again for Mr. @KodiaqQ 's constructive advice. First, to answer your question, nncf.quantize_with_accuracy_control was used because it indeed has a better ability to maintain accuracy, at least in terms of the test set. And the consumption of memory and time during quantization is acceptable. We are more sensitive to the accuracy and speed of the quantized model.

Based on your suggestion, the results on different hardware were aligned, but I therefore have some new questions:

Theoretically, does using nncf.OverflowFix have an impact on the accuracy and speed of the quantized Int8 model?
Compared with nncf.quantize, is there a difference in the inference speed of the Int8 model exported by nncf.quantize_with_accuracy_control? For the above two questions, if there is a clear conclusion within nncf, please tell me. If not, I will conduct experiments to verify it myself.

liubo-intel commented 2 weeks ago

@liubo-intel @KodiaqQ First of all, I would like to thank both of you for your help.And I'm sorry for being too busy lately and only replying now.

Regarding Mr. @liubo-intel 's reply, I initially attempted to verify whether the CPUs I was using contained the VNNI instruction. I used the CPU-Z software, the Intel® Processor Identification Utility software, and the benchdnn of oneDNN to verify whether the Xeon processors, i5-7400 and i7-12700 I was using contained the VNNI instruction, but the results showed that none did. Further inquiries revealed that currently, VNNI is an extended instruction set of AVX512, usually named AVX512_VNNI, and AVX512_VNNI is part of DL Boost (Intel® Processor Identification Utility shows that i7-12700 does not support AI Boost. I'm not sure if this is DL Boost). I began to suspect whether the problem was caused by VNNI, but based on Mr. @KodiaqQ 's reply, I did solve the problem, so the root cause was still on the saturation issue. So, is it because the AVX512 instruction set was not explicitly enabled in the bios that I couldn't query it? And does openvino not require the explicit enabling of related instruction sets and can be called forcibly from the bottom layer? (This question is actually not that important to me because the final problem has indeed been solved)

Thank you again for Mr. @KodiaqQ 's constructive advice. First, to answer your question, nncf.quantize_with_accuracy_control was used because it indeed has a better ability to maintain accuracy, at least in terms of the test set. And the consumption of memory and time during quantization is acceptable. We are more sensitive to the accuracy and speed of the quantized model.

Based on your suggestion, the results on different hardware were aligned, but I therefore have some new questions:

Theoretically, does using nncf.OverflowFix have an impact on the accuracy and speed of the quantized Int8 model?

Compared with nncf.quantize, is there a difference in the inference speed of the Int8 model exported by nncf.quantize_with_accuracy_control? For the above two questions, if there is a clear conclusion within nncf, please tell me. If not, I will conduct experiments to verify it myself.

Hi, @rememberBr : glad to see that your problem has been solved by NNCF method. about your 'instruction set' question, as far as I know i5-7400 and i7-12700 are all not AVX512 but AVX2 platforms. and not only AVX512 support VNNI instructions, some AVX2 platforms(e.g. this i7-12700) also support VNNI. Commonly Openvino not require the explicit enabling of related instruction sets, it will use them automatically if you correctly installed the related driver of your hardware.

rememberBr commented 2 weeks ago

@liubo-intel Ok, I got it. Because the initial issue has been resolved, I will close this issue. Finally, I wish you all the best.