yolo-v8 in onnx-runtime outperforms deepsparse on iMX8

Fritskee commented 8 months ago

Describe the bug I downloaded and tested the yolov8-s-coco-pruned70_quantized model from the sparseZoo. When I simply infere the onnx model with onnx-runtime, I get an average of 1,92 seconds (over 100 runs). When I do the same experiment with DeepSparse, I get an average of 2,08 seconds (over 100 runs).

Expected behavior The onnx model that is provided in the SparseZoo should outperform the inference times from onnx-runtime when used with DeepSparse, however I am experiencing the opposite.

Environment Include all relevant environment information:

OS: Yocto (Linux)
Python version: 3.10
DeepSparse version: 1.6.1
ML framework version(s) [e.g. torch 1.7.1]: N/A
Other Python package versions [e.g. SparseML, Sparsify, numpy, ONNX]:
- numpy: 1.26.3
- onnx: 1.14.1

To Reproduce Script that I use to run ONNX model:

import numpy as np
import onnxruntime as ort
import time

# Load the ONNX model
model_path = '/tmp/model.onnx'
session = ort.InferenceSession(model_path)

# Define input and output names for the model
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

times = []
for i in range(100):
    # Generate a random RGB image 
    random_image = np.random.rand(1, 3, 640, 640).astype(np.uint8)

    # Perform inference
    start_time = time.time()
    result = session.run([output_name], {input_name: random_image})
    end_time = time.time()

    inference_time = end_time - start_time
    times.append(inference_time)
    print(f"Inference {i+1}: {inference_time:.6f} seconds")

average_time = sum(times) / len(times)
print(f"Average inference time: {average_time:.6f} seconds")

Script that I use to run with DeepSparse:

import numpy as np
from deepsparse import compile_model
import time

# Path to your ONNX model
model_path = '/tmp/model.onnx' 

# Compile the model for inference with Deepsparse
compiled_model = compile_model(model_path, batch_size=1)

num_inferences = 100
inference_times = []

for i in range(num_inferences):
    # Generate a new random image for each inference
    random_image = np.random.rand(1, 3, 640, 640).astype(np.uint8)

    # Start time
    start_time = time.time()

    # Run inference - wrap the random_image in a list
    outputs = compiled_model.run([random_image])

    # End time
    end_time = time.time()

    # Duration of the inference
    duration = end_time - start_time
    inference_times.append(duration)

    # Print the duration of the current inference
    print(f"Inference {i + 1}: {duration:.4f} seconds")

# Calculate and print the average inference time
average_time = sum(inference_times) / len(inference_times)
print(f"\nAverage inference time over {num_inferences} inferences: {average_time:.4f} seconds")

Additional context It is very unclear to me why I am not experiencing any speed-ups from DeepSparse when doing inference with a model that is provided in the sparseZoo. I did not do any custom training, I just took it from the zoo as is and did inference with the above two scripts.

Can anybody point me in the right direction on how I can fix this, or can you clarify as to whether this is normal behaviour?

Tech spec of the CPU of the edge device that I'm testing on: Processor: i.MX 8M Plus Quad Architecture: ARM Cortex-A53 / Cortex-M7 Frequency: 4x 1.8 GHz (A53), 800 MHz (M7) SPI NOR Flash:64 MB eMMC: 8 GB eMMC 5.1 LPDDR4 RAM: 2 GB EEPROM: 4 kB

mgoin commented 8 months ago

Hey @Fritskee, I believe the ARM Cortex-A53 CPU you're using is at the bottom-end or even off the scale of what ARM CPUs we support. Officially we've targeted ones that have ARM v8.2+ based ISAs. It might be the case that it is simply falling back to our naive backend, although there should be a warning for that. Could you also try running with a batch size of 16/64 to see if there is a difference made there?

Fritskee commented 8 months ago

Hey @Fritskee, I believe the ARM Cortex-A53 CPU you're using is at the bottom-end or even off the scale of what ARM CPUs we support. Officially we've targeted ones that have ARM v8.2+ based ISAs. It might be the case that it is simply falling back to our naive backend, although there should be a warning for that. Could you also try running with a batch size of 16/64 to see if there is a difference made there?

Hi Michael, the batch inference doesn't seem to make a big difference either. Another thing that I noticed is that when I infere the same model as I mentioned above (yolo v8-s pruned at 70% and quantized to uint8), the model inference in onnx-runtime will be slower than when I just run the baseline fp32 model in onnx-runtime. Just out of curiosity, could you share some insights as to why this is? (I also tested it on my Mac M2 and results also hold there)

mgoin commented 5 months ago

Hey @Fritskee I'm really not sure about this hardware and the CPU barely has the supported operations for what we need to run with deepsparse. I would consider this out of scope for optimization potential unfortunately. Thanks for reporting the issue

yoloyash commented 3 months ago

Hi @mgoin, seems like this an issue with SparseML because I have tried on G4/G5/G6/P3 instance on EC2 (CPU and GPU), and on both backends (deepsparse, onnxruntime). I'm not sure why, but every time the uint8 model has been much slower than the fp32 models. Not sure if I'm missing out any postprocessing to the onnx model because I'm following the exact same recipes. Please help.

Fritskee commented 3 months ago

Hi @mgoin, seems like this an issue with SparseML because I have tried on G4/G5/G6/P3 instance on EC2 (CPU and GPU), and on both backends (deepsparse, onnxruntime). I'm not sure why, but every time the uint8 model has been much slower than the fp32 models. Not sure if I'm missing out any postprocessing to the onnx model because I'm following the exact same recipes. Please help.

It probably has to do with the CPU that you were using. Like he explains in a post above you need ARM v8.2+ for the deepsparse runtime to work well. Highly likely that you’re using a CPU which use X86 instructions. That would also be why your onnx model is faster.

Double check you’re using an ARM based processor with with v8.2 or higher instruction set

yoloyash commented 3 months ago

@Fritskee Hey, thanks for the quick reply! That's true, but that shouldn't be the case for onnxruntime right? Or am I wrong?

neuralmagic / deepsparse

yolo-v8 in onnx-runtime outperforms deepsparse on iMX8 #1532