Closed Fritskee closed 5 months ago
Hey @Fritskee, I believe the ARM Cortex-A53 CPU you're using is at the bottom-end or even off the scale of what ARM CPUs we support. Officially we've targeted ones that have ARM v8.2+ based ISAs. It might be the case that it is simply falling back to our naive backend, although there should be a warning for that. Could you also try running with a batch size of 16/64 to see if there is a difference made there?
Hey @Fritskee, I believe the ARM Cortex-A53 CPU you're using is at the bottom-end or even off the scale of what ARM CPUs we support. Officially we've targeted ones that have ARM v8.2+ based ISAs. It might be the case that it is simply falling back to our naive backend, although there should be a warning for that. Could you also try running with a batch size of 16/64 to see if there is a difference made there?
Hi Michael, the batch inference doesn't seem to make a big difference either. Another thing that I noticed is that when I infere the same model as I mentioned above (yolo v8-s pruned at 70% and quantized to uint8), the model inference in onnx-runtime will be slower than when I just run the baseline fp32 model in onnx-runtime. Just out of curiosity, could you share some insights as to why this is? (I also tested it on my Mac M2 and results also hold there)
Hey @Fritskee I'm really not sure about this hardware and the CPU barely has the supported operations for what we need to run with deepsparse. I would consider this out of scope for optimization potential unfortunately. Thanks for reporting the issue
Hi @mgoin, seems like this an issue with SparseML because I have tried on G4/G5/G6/P3 instance on EC2 (CPU and GPU), and on both backends (deepsparse, onnxruntime). I'm not sure why, but every time the uint8 model has been much slower than the fp32 models. Not sure if I'm missing out any postprocessing to the onnx model because I'm following the exact same recipes. Please help.
Hi @mgoin, seems like this an issue with SparseML because I have tried on G4/G5/G6/P3 instance on EC2 (CPU and GPU), and on both backends (deepsparse, onnxruntime). I'm not sure why, but every time the uint8 model has been much slower than the fp32 models. Not sure if I'm missing out any postprocessing to the onnx model because I'm following the exact same recipes. Please help.
It probably has to do with the CPU that you were using. Like he explains in a post above you need ARM v8.2+ for the deepsparse runtime to work well. Highly likely that you’re using a CPU which use X86 instructions. That would also be why your onnx model is faster.
Double check you’re using an ARM based processor with with v8.2 or higher instruction set
@Fritskee Hey, thanks for the quick reply! That's true, but that shouldn't be the case for onnxruntime right? Or am I wrong?
Describe the bug I downloaded and tested the yolov8-s-coco-pruned70_quantized model from the sparseZoo. When I simply infere the onnx model with onnx-runtime, I get an average of 1,92 seconds (over 100 runs). When I do the same experiment with DeepSparse, I get an average of 2,08 seconds (over 100 runs).
Expected behavior The onnx model that is provided in the SparseZoo should outperform the inference times from onnx-runtime when used with DeepSparse, however I am experiencing the opposite.
Environment Include all relevant environment information:
To Reproduce Script that I use to run ONNX model:
Script that I use to run with DeepSparse:
Additional context It is very unclear to me why I am not experiencing any speed-ups from DeepSparse when doing inference with a model that is provided in the sparseZoo. I did not do any custom training, I just took it from the zoo as is and did inference with the above two scripts.
Can anybody point me in the right direction on how I can fix this, or can you clarify as to whether this is normal behaviour?
Tech spec of the CPU of the edge device that I'm testing on: Processor: i.MX 8M Plus Quad Architecture: ARM Cortex-A53 / Cortex-M7 Frequency: 4x 1.8 GHz (A53), 800 MHz (M7) SPI NOR Flash:64 MB eMMC: 8 GB eMMC 5.1 LPDDR4 RAM: 2 GB EEPROM: 4 kB