neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.01k stars 140 forks source link

Performance Degradation in YOLOv8s Model Exported to ONNX via SparseML's Exporter #2276

Open rsazizov opened 1 month ago

rsazizov commented 1 month ago

Describe the bug

When exporting the YOLOv8s (pruned50-quant, model.pt from sparsezoo) model via the ONNX exporter (sparseml.ultralytics.export_onnx), its performance noticeably decreases compared to the ONNX model available in SparseZoo

Expected behavior

Perfomance of the two ONNX files should be the same, as it is the same model.

Environment Include all relevant environment information:

  1. OS: Ubuntu 22.04
  2. Python version: 3.9.19
  3. SparseML version or commit hash: sparseml==1.7.0
  4. ML framework version(s): torch==2.1.2
  5. Other Python package versions: deepsparse==1.7.1, sparsezoo==1.7.0, ultralytics==8.0.124
  6. Other relevant environment information: CPU: i9-12900KS

To Reproduce Exact steps to reproduce the behavior:

Download model.onnx for yolov8s-pruned50-quant from SparseZoo (https://sparsezoo.neuralmagic.com/models/yolov8-s-coco-pruned50_quantized). Benchmark it using deepsparse.benchmark:

> deepsparse.benchmark yolov8s-coco-pruned50_quantized.onnx
2024-05-10 13:56:31 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
    onnx_file_path: yolov8s-coco-pruned50_quantized.onnx
    batch_size: 1
    num_cores: 8
    num_streams: 1
    scheduler: Scheduler.default
    fraction_of_supported_ops: 1.0
    cpu_avx_type: avx2
    cpu_vnni: True
2024-05-10 13:56:31 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: yolov8s-coco-pruned50_quantized.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 87.1154
Latency Mean (ms/batch): 11.4735
Latency Median (ms/batch): 11.4148
Latency Std (ms/batch): 0.2300
Iterations: 872

Notice fraction_of_supported_ops: 1.0 and Throughput (items/sec): 87.1154.

Now download model.pt from the same page and export it to ONNX using the provided tool:

> sparseml.ultralytics.export_onnx --model yolov8s-coco-pruned50_quantized.pt

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2147008  ultralytics.nn.modules.head.Detect           [80, [128, 256, 512]]         
Model summary: 225 layers, 11166560 parameters, 11166544 gradients

Applying structure from sparseml checkpoint at epoch -1
2024-05-10 13:58:11 sparseml.pytorch.utils.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/10-05-2024_13.58.11.log
Loaded previous weights from checkpoint
Source: 'sparseml' detected; Exporting model from SparseML checkpoint...
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/torch/onnx/utils.py:823: UserWarning: It is recommended that constant folding be turned off ('do_constant_folding=False') when exporting the model in training-amenable mode, i.e. with 'training=TrainingMode.TRAIN' or 'training=TrainingMode.PRESERVE' (when model is in training mode). Otherwise, some learnable model parameters may not translate correctly in the exported ONNX model because constant folding mutates model parameters. Please consider turning off constant folding or setting the training=TrainingMode.EVAL.
  warnings.warn(
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/ultralytics/nn/modules/head.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif self.dynamic or self.shape != shape:
2024-05-10 13:58:15 sparseml.exporters.transforms.onnx_transform INFO     [ConstantsToInitializers] Transformed 92 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldIdentityInitializers] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [InitializersToUint8] Transformed 54 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FlattenQParams] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldConvDivBn] Transformed 57 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [DeleteRepeatedQdq] Transformed 2 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [QuantizeQATEmbedding] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [PropagateEmbeddingQuantization] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [PropagateDequantThroughSplit] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [MatMulAddToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [MatMulToMatMulIntegerCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldReLUQuants] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [ConvToConvIntegerAddCastMul] Transformed 55 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [GemmToQLinearMatMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [GemmToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [QuantizeResiduals] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [RemoveDuplicateQConvWeights] Transformed 0 matches
2024-05-10 13:58:17 sparseml.exporters.transforms.onnx_transform INFO     [RemoveDuplicateQuantizeOps] Transformed 0 matches
2024-05-10 13:58:17 sparseml.pytorch.sparsification.quantization.quantize_qat_export INFO     Model initial QuantizeLinear node(s) deleted and inputs set to uint8
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Created deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Saved model.onnx in the deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment/model.onnx
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Created config.json file at /home/user/Desktop/projects/sparse/issue/exported/deployment
Recipe checkpoint detected, saving the recipe to the deployment directory /home/user/Desktop/projects/sparse/issue/exported/deployment

Conversion is successful. Now benchmark exported onnx model:

> deepsparse.benchmark exported/model.onnx 
2024-05-10 13:59:27 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
    onnx_file_path: exported/model.onnx
    batch_size: 1
    num_cores: 8
    num_streams: 1
    scheduler: Scheduler.default
    fraction_of_supported_ops: 0.0
    cpu_avx_type: avx2
    cpu_vnni: True
2024-05-10 13:59:27 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: exported/model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 20.2886
Latency Mean (ms/batch): 49.2855
Latency Median (ms/batch): 49.0293
Latency Std (ms/batch): 2.1290
Iterations: 203

Notice fraction_of_supported_ops: 0.0 and Throughput (items/sec): 20.2886.

Throughput decreased from ~88 down to ~20 for the same model.

mgoin commented 1 month ago

Model exported: https://drive.google.com/file/d/1ZDlRd6c1X05lrnxRThUo8FxuapS5Kgm7/view?usp=sharing

You can see that this style of Conv is not being folded to a ConvInteger correctly - @bfineran

Screenshot 2024-05-10 at 9 52 55 AM
bfineran commented 1 month ago

@mgoin we'll need to take a look at the recipe and its application - conv integer requires two quantized inputs (weight and act) to the Conv, here we see a quantize (weight) input and the output being quantized (although this may be the input quantization to another layer)

imAhmadAsghar commented 3 weeks ago

@bfineran Thank you for great work :)

Wanted to let you know that I exactly am having the same performance degradation as @rsazizov on yolov8n from Throughput (items/sec): 110.0278 (on sparsezoo-yolov8n) to Throughput (items/sec): 15.5770 after converting the sparsezoo-yolov8n .pt model using sparseml onnx exporter. Is there any known bug or update on the issue?

bfineran commented 2 weeks ago

Hi @imAhmadAsghar we're aware of the issue and are looking into it internally - it doesn't seem to be a version compatibility issue, but you could potentially try rolling back your sparseml/pytorch versions. The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices).

imAhmadAsghar commented 2 weeks ago

@bfineran Thank you for your response.

I actually did not get the last part of your response which is "The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices)." Can you please explain what do you mean by that in detail, if possible? I am not a performance/optimization engineer and I just want to use sparseml/deepsparse to speed up the inference on CPU. However, the whole library is inconvenient and super foggy.

I have tested the following:

And here are the results: Performance test between pruned and default model: image As you can see in the above plot that the prunning does nothing.

Performance test between pruned vs pruned and quantized model: image I just don't get this plot. Nothing makes sense at all. The quantization does not work and it is getting super slow by a high margin.

Right now, I am super confused and it does not make any sense to use your library at all. I think I am lacking a lot of information regarding the whole process. Can you please provide me with the proper reference where to start because the one that is provided on the homepage is not leading me anywhere as you can see from the results.

I would really love to get it run and achieve the results you promised.

yoloyash commented 6 days ago

@imAhmadAsghar Hi, could you find a fix to this? What is going wrong with the exports?

imAhmadAsghar commented 3 days ago

@yoloyash Hi, no I could not unfortunately.