No speedup after quantization of MobilenetV2 architecture on OpenVINO 2020.4

nmVis commented 4 years ago

Hi, I'm having problem with the quantization of Mobilenet V2 architecture. I'm expecting that quantization should improve the performance of Mobilenet V2 architecture. However, I don't get the expected result.

The onnx model I'm using is available at the following link which is from official ONNX repo.

After converting it using the mo.py script it runs at around 10 ms

After conversion using the following json and yml files: mnet.json

{
    /* Model */
    "model": {
        "model_name": "mnet_v2", // Model name
        "model": "mnetv2.vino.xml", // Path to model (.xml format)
        "weights": "mnetv2.vino.bin" // Path to weights (.bin format)
    },
    /* Parameters of the engine used for model inference. */
    /* Post-Training Optimization Tool supports engine based on accuracy checker and custom engine.
       For custom engine you should specify your own set of parameters.
       The engine based on accuracy checker uses accuracy checker parameters. You can specify the parameters
       via accuracy checker config file or directly in engine section.
       More information about accuracy checker parameters can be found here:
       https://github.com/opencv/open_model_zoo/tree/master/tools/accuracy_checker */
    "engine": {
        //"type": "simplified", // OR default value "type": "accuracy_checker" for non simplified mode
        "type": "accuracy_checker",
        // you can specify path to directory with images
        // also you can specify template for file names to filter images to load
        // templates are unix style (This option valid only in simplified mode)
        //"data_source": "D:/temp/BS/Data/BS/test/originals",
        "config": "mnet_v2.yml",
    },
    /* Optimization hyperparameters */
    "compression": {
        "target_device": "CPU", // target device, the specificity of which will be taken into account during optimization
        "algorithms": [
            {
                "name": "DefaultQuantization", // optimization algorithm name
                "params": {
                    /* A preset is a collection of optimization algorithm parameters that will specify to the algorithm
                    to improve which metric the algorithm needs to concentrate. Each optimization algorithm supports
                    [performance, accuracy] presets */
                    "preset": "performance",
                    "stat_subset_size": 200, // Size of subset to calculate activations statistics that can be used
                                             // for quantization parameters calculation.
                    /* Manually specification quantization parametrs */
                    /* Quantization parameters for weights */

                    "weights": {
                        "bits": 8,           // Number of quantization bits
                        "mode": "symmetric", // Quantization mode
                        "granularity": "perchannel", // Granularity: a scale for each output channel.
                        "level_low": -127, // Low quantization level
                        "level_high": 127, // High quantization level
                        /* Parameters specify how to calculate the minimum and maximum of quantization range */
                        "range_estimator": {
                            "max": {
                                "type": "quantile",
                                "outlier_prob": 0.0001
                            }
                        }
                    },
                    /* Quantization parameters for activations */
                    "activations": {
                        "bits": 8,                  // Number of quantization bits
                        "mode": "asymmetric",       // Quantization mode
                        "granularity": "pertensor", // Granularity: one scale for output tensor.
                        /* Parameters specify how to calculate the minimum and maximum of quantization range */
                        "range_estimator": {
                            "preset": "quantile",
                            /* OR */
                            /* minimum of quantization range */
                            "min": {
                                "clipping_value": 0, // Threshold for min statistic value clipping (lower bound)
                                "aggregator": "mean",  // Batch aggregation type [mean, max, min, median, mean_no_outliers, median_no_outliers, hl_estimator]
                                "type": "quantile",    // Estimator type [min, max, abs_max, quantile, abs_quantile]
                                "outlier_prob": 0.0001 // Outlier probability: estimator consider samples which
                            },
                            /* maximum of quantization range */
                            "max": {
                                "clipping_value": 6, // Threshold for max statistic value clipping (upper bound)
                                "aggregator": "mean", // Batch aggregation type [mean, max, min, median, mean_no_outliers, median_no_outliers, hl_estimator]
                                "type": "quantile",
                                "outlier_prob": 0.0001
                            }
                        }
                    }
                }
            }
        ]
    }
}

mnet.yml

models:
  - name: mnetv2

    # list of launchers for your topology.
    launchers: 
        # launcher framework (e.g. caffe, dlsdk)
      - framework: dlsdk
        # device for infer (e.g. for dlsdk cpu, gpu, hetero:cpu, gpu ...)
        device: CPU
        # topology IR (*.prototxt for caffe, *.xml for InferenceEngine, etc)
        # path to topology is prefixed with directory, specified in "-m/--models" option
        model: mnetv2.vino.xml
        weights: mnetv2.vino.bin
        # launcher returns raw result, so it should be converted
        # to an appropriate representation with adapter
        adapter: 
          type: regression

    # metrics, preprocessing and postprocessing are typically dataset specific, so dataset field
    # specifies data and all other steps required to validate topology
    # there is typically definitions file, which contains options for common datasets and which is merged
    # during evaluation, but since "sample_dataset" is not used anywhere else, this config contains full definition
    datasets:
      # uniquely distinguishable name for dataset
      # note that all other steps are specific for this dataset only
      # if you need to test topology on multiple datasets, you need to specify
      # every step explicitly for each dataset
      - name: test_dataset
        reader: numpy_reader
        data_source: dummy_data
        annotation_conversion:
                        # Converter name which will be called for conversion.
          converter: visage_regression_converter
          annotation_file: dummy_data\outputs.json
                        # Converter specific parameters, can be different depend on converter realization.
          annotation: mnet_v2\regression_converter.pickle

        metrics: 
          - type: mse

With command pot -c mnet.json I get a quantized model that runs at 10 ms just as fp32 model.

However, the model OpenVINO is providing with Mobilenet v2 backbone is running at 10 ms for fp32 and at 7 ms for quantized model. Specifically, the model is available at the following link.

What could I be doing wrong?

Thanks in advance.

Regards, Nikola

dmitryte commented 4 years ago

Hi, @nmVis! What is your version of OpenVINO?

Could you run your model with the example config from openvino? \deployment_tools\tools\post_training_optimization_toolkit\configs\examples\quantization\classificationmobilenetV2_tf_int8_simple_mode.json You only need to put paths to your xml and bin files.

nmVis commented 4 years ago

Hi @dmitryte!

I'm using OpenVINO 2020.4.

I could do that one. However, I've already tried to use the same XML and .json files (just changed the dataset) for calibrating the OpenVINO's fp32 reidentification model I've mentioned earlier and the mobilenet model that is also mentioned earlier.

I got performance improvement for OpenVINO's fp32 reidentification model while I get no improvement for the mobilenet model so this should prove that xml and json files are not the problem.

AlexKoff88 commented 4 years ago

Hi @dmitryte,

I suggest simplifying the POT config to the following one:

{
    /* Model */
    "model": {
        "model_name": "mnet_v2", // Model name
        "model": "mnetv2.vino.xml", // Path to model (.xml format)
        "weights": "mnetv2.vino.bin" // Path to weights (.bin format)
    },
    /* Parameters of the engine used for model inference. */
    /* Post-Training Optimization Tool supports engine based on accuracy checker and custom engine.
       For custom engine you should specify your own set of parameters.
       The engine based on accuracy checker uses accuracy checker parameters. You can specify the parameters
       via accuracy checker config file or directly in engine section.
       More information about accuracy checker parameters can be found here:
       https://github.com/opencv/open_model_zoo/tree/master/tools/accuracy_checker */
    "engine": {
        //"type": "simplified", // OR default value "type": "accuracy_checker" for non simplified mode
        "type": "accuracy_checker",
        // you can specify path to directory with images
        // also you can specify template for file names to filter images to load
        // templates are unix style (This option valid only in simplified mode)
        //"data_source": "D:/temp/BS/Data/BS/test/originals",
        "config": "mnet_v2.yml",
    },
    /* Optimization hyperparameters */
    "compression": {
        "target_device": "CPU", // target device, the specificity of which will be taken into account during optimization
        "algorithms": [
            {
                "name": "DefaultQuantization", // optimization algorithm name
                "params": {
                    /* A preset is a collection of optimization algorithm parameters that will specify to the algorithm
                    to improve which metric the algorithm needs to concentrate. Each optimization algorithm supports
                    [performance, accuracy] presets */
                    "preset": "mixed",
                    "stat_subset_size": 200 // Size of subset to calculate activations statistics that can be used
                                             // for quantization parameters calculation.
                }
            }
        ]
    }
}

BTW, what CPU model do you use?

nmVis commented 4 years ago

Hi @AlexKoff88 ,

The CPU I'm using is Intel i7-8750H CPU @ 2.20GHz.

nmVis commented 4 years ago

@AlexKoff88 do you have any news regarding this one?

AlexKoff88 commented 4 years ago

@nmVis, have you tried simplified configuration file from above?

I would even suggest using "preset" "performance" instead of "mixed", like here:

{
    "model": {
        "model_name": "mobilenetv2",
        "model": "mnetv2.vino.xml",
        "weights": "mnetv2.vino.bin"
    },
    "engine": {
        "config": "mnet_v2.yml"
    },
    "compression": {
        "target_device": "CPU",
        "algorithms": [
            {
                "name": "DefaultQuantization",
                "params": {
                    "preset": "performance",
                    "stat_subset_size": 300
                }
            }
        ]
    }
}

nmVis commented 4 years ago

Thanks for the answer @AlexKoff88 , I'll try it first thing tomorrow morning and let you know about the results.

nmVis commented 4 years ago

@AlexKoff88 I'm sorry to be the bearer of bad news, but this one didn't show any speed improvements.

Any other ideas?

AlexKoff88 commented 4 years ago

Then we should look at it on our side. @nmVis, can you please provide a MO command that you used to convert ONNX model and get OpenVINO IR?

nmVis commented 4 years ago

The command I've used is: python mo.py --input_model mobilenetv2-7.onnx

arfangeta commented 4 years ago

@nmVis , how do you measure performance? Using openvino benchmark tool?

nmVis commented 4 years ago

@arfangeta hi!

No, I measure it using the google benchmark and measuring only the inference time of the network.

arfangeta commented 4 years ago

@nmVis Openvino has a tool for measuring the speed of inference, this is a openvino benchmark tool.(https://docs.openvinotoolkit.org/2020.4/_inference_engine_tools_benchmark_tool_README.html)

I took the nasnet-mobile model from the official onnx repository for classification (your first link is attached) and applied the default quantization. And I have such results: before quantization 1174.17 FPS, after quantization 3214.69 FPS. (2,7x performance) Please share the code using a google benchmark that reproduces your results.

nmVis commented 4 years ago

nasnet-mobile is not the same as the model that I've provided the link for. I don't get what you wanted to accomplish with it. Could you explain?

dmitryte commented 4 years ago

Hi @nmVis!

I checked the provided onnx model with the sample pot configuration(Simple mode) from the above messages and see the following results:

FP32 Count: 125450 iterations Duration: 60009.38 ms Latency: 4.74 ms Throughput: 2090.51 FPS

INT8 Count: 186890 iterations Duration: 60004.92 ms Latency: 3.11 ms Throughput: 3114.58 FPS

nmVis commented 4 years ago

Hi @dmitryte !

Interesting, I'm getting this kind of results with the sample pot configuration.

Async version:

FP32 Count: 15972 iterations Duration: 60014.64 ms Latency: 13.71 ms Throughput: 266.14 FPS

INT8 Count: 24380 iterations Duration: 60019.22 ms Latency: 9.43 ms Throughput: 406.20 FPS

Sync version:

FP32 Count: 7240 iterations Duration: 60006.54 ms Latency: 8.07 ms Throughput: 123.91 FPS

INT8 Count: 7470 iterations Duration: 60000.28 ms Latency: 7.65 ms Throughput: 130.67 FPS

As we can observe, the async version provides significant speedup but it isn't of much interest since the synchronous way is faster. Also, the difference in performance between synchronous FP32 and INT8 is negligible.

What mode did you run your tests in (-api flag of the benchmark_app.py)?

nmVis commented 4 years ago

@dmitryte Hi! Any updates or comments?

dmitryte commented 4 years ago

Hi, @nmVis

I got the results above using default execution of the benchmark_app with async api.

The increased latency is expected with async API but you also get more FPS. Sync mode is more suitable for real-time apps since latency is critical in this case.

You can also tweak number of threads and infer requests with the benchmark_app. Though, defaults are usually optimal for most of the cases but we still recommend to play with these numbers and adopt for your specific case. You can check the following guide on the perfomance optimization - https://docs.openvinotoolkit.org/latest/openvino_docs_optimization_guide_dldt_optimization_guide.html

nmVis commented 4 years ago

Hi @dmitryte,

since we have a real-time app on our end, the async mode isn't interesting to us. Do you have some idea what could cause the lack of performance increase in the case of the INT8 model in sync mode?

I'm expecting the INT8 model to be faster than the FP32 variant for the same number of threads. What could be the cause that it doesn't behave like that?

dmitryte commented 4 years ago

Hi @nmVis

I ran a benchmark one more time with your model and get the following results:

For int8 I get 3x speed up and lower latency.

Could you check your measurements one more time?

FP32 - SYNC Count: 13329 iterations Duration: 60003.86 ms Latency: 4.63 ms Throughput: 215.79 FPS

INT8 - SYNC Count: 41696 iterations Duration: 60000.94 ms Latency: 1.42 ms Throughput: 703.21 FPS

nmVis commented 4 years ago

@dmitryte I've checked my measurements one more time and they're similar. Could it be a CPU related issue? What CPU do you test on? Maybe some of my colleagues have one so I could check on their PC if it is the case. Mine is i7 8750H if it helps you in some way.

dmitryte commented 4 years ago

Hmm, you still should get some improvement on 8th gen cause I've got i5 8th gen on my laptop and can see the speed up. Though , it's lower than in the benchmarks I posted last week due to unsupported AVX-512 instructions.

Starting with 10th gen you can have a support for DL Boost that makes int8 models run even faster. Same is supported on for the server hardware.

We may go in PM and discuss specifics of your real-time app if you don't mind.

nmVis commented 4 years ago

Thanks man. Let's go to the PM.

openvinotoolkit / openvino

No speedup after quantization of MobilenetV2 architecture on OpenVINO 2020.4 #1858