slow inference on ARM64 device

noamholz commented 3 years ago

Description

Hi there, I am using a Rockchip - RK3399 (64-bit CPUS: Dual Cortex-A72 + Quad Cortex-A53, USB: 3.0), with Ubuntu 18.04.1 LTS, Python 3.6.8. When running examples/detect_image.py with the model efficientdet_lite2_448_ptq_edgetpu.tflite: the inference takes ~330 ms, while I expected ~100 ms (from the published benchmark). Is my expectation realistic? or, are there issues in my setup that I'm not aware of?

Here's what I've done so far:

I installed libedgetpu1-std & python3-pycoral, as described in the official instructions.
I stored the input images locally (on the SD card).
I have set the 'performance' mode for the cpu freq.
I made sure that the device connection is recognized as USB 3 (lsusb -> bcdusb=3.1).

Finally, I tried the benchmarks/inference_benchmarks.py, and got:

******************** Check results *********************
 * Unexpected high latency! [inception_v1_224_quant_edgetpu.tflite]
   Inference time: 6.283602199999905 ms  Reference time: 4.0 ms
 * Unexpected high latency! [mobilenet_v1_1.0_224_quant_edgetpu.tflite]
   Inference time: 4.973522705000164 ms  Reference time: 2.22 ms
 * Unexpected high latency! [mobilenet_v2_1.0_224_quant_edgetpu.tflite]
   Inference time: 5.482743665000385 ms  Reference time: 2.56 ms
 * Unexpected high latency! [ssd_mobilenet_v2_face_quant_postprocess_edgetpu.tflite]
   Inference time: 10.197461029999886 ms  Reference time: 7.78 ms
******************** Check finished! *******************

I would appreciate any help, Thanks!

Click to expand!

### Issue Type Performance ### Operating System Ubuntu ### Coral Device USB Accelerator ### Other Devices _No response_ ### Programming Language Python 3.6 ### Relevant Log Output _No response_

hjonnala commented 3 years ago

Hello @noamholz Could you please share the link to published benchmarks?

with USB accelerator inference speeds might differ based on your host system and whether you're using USB 2.0 or 3.0.

Here rare the results on my Linux Machine:

With libedgetpu1-max

 python3 examples/detect_image.py   --model /home/Desktop/issues/pycoral_46/efficientdet_lite2_448_ptq_edgetpu.tflite   --labels test_data/coco_labels.txt   --input test_data/grace_hopper.bmp   --output ${HOME}/grace_hopper_processed.bmp
----INFERENCE TIME----
Note: The first inference is slow because it includes loading the model into Edge TPU memory.
128.92 ms
101.32 ms
95.12 ms
95.44 ms
107.40 ms
-------RESULTS--------
person
  id:     0
  score:  0.9609375
  bbox:   BBox(xmin=3, ymin=22, xmax=510, ymax=600)

with libedgetpu1-std

 python3 examples/detect_image.py   --model /home/Desktop/issues/pycoral_46/efficientdet_lite2_448_ptq_edgetpu.tflite   --labels test_data/coco_labels.txt   --input test_data/grace_hopper.bmp   --output ${HOME}/grace_hopper_processed.bmp
----INFERENCE TIME----
Note: The first inference is slow because it includes loading the model into Edge TPU memory.
147.90 ms
128.71 ms
120.05 ms
120.40 ms
127.54 ms
-------RESULTS--------
person
  id:     0
  score:  0.9609375
  bbox:   BBox(xmin=3, ymin=22, xmax=510, ymax=600)

noamholz commented 3 years ago

Thanks @hjonnala for your reply,

USB accelerator inference speeds might differ based on your host system and whether you're using USB 2.0 or 3.0

My system is indeed using USB 3.0, and the CPUs: Dual Cortex-A72 + Quad Cortex-A53.

Here's how I relate my case to the published benchmark - I expect my system to be similar enough to the Coral dev-board, since they seem to have similar computing power. But while the dev-board is on par with "Desktop CPU + USB Accelerator" (2.6ms inference time on mobilenet_v2, for example), my system is much slower (5.5ms inference time on mobilenet_v2).

An important difference between my system & the dev-board, is that the latter uses a PCI connection - would that explain most of the difference in performance? Would you recommend any tests that I perform to identify the bottleneck?

Thanks again!

hjonnala commented 3 years ago

Operating frequency also affects the inference time. Can you try installing edgetpu runtime with maximum operating frequency? (sudo apt-get install libedgetpu1-max)

On my Linux machine(x_86_64), with standard operating frequency

******************** Check results *********************
 * Unexpected high latency! [inception_v1_224_quant_edgetpu.tflite]
   Inference time: 5.300638125045225 ms  Reference time: 3.06 ms
 * Unexpected high latency! [mobilenet_v1_1.0_224_quant_edgetpu.tflite]
   Inference time: 3.9057857799343765 ms  Reference time: 2.17 ms
 * Unexpected high latency! [mobilenet_v2_1.0_224_quant_edgetpu.tflite]
   Inference time: 4.219176759943366 ms  Reference time: 2.29 ms
 * Unexpected high latency! [ssd_mobilenet_v2_face_quant_postprocess_edgetpu.tflite]
   Inference time: 8.32953658507904 ms  Reference time: 5.36 ms
******************** Check finished! *******************

with maximum operating frequency

******************** Check results *********************
 * Unexpected low latency! [ssd_mobilenet_v1_coco_quant_postprocess_edgetpu.tflite]
   Inference time: 6.677852355060168 ms  Reference time: 10.02 ms
******************** Check finished! *******************

noamholz commented 3 years ago

Thanks @hayatoy. Using libedgetpu1-max certainly improves the speeds.

******************** Check results *********************
 * Unexpected high latency! [mobilenet_v1_1.0_224_quant_edgetpu.tflite]
   Inference time: 3.5727628600000116 ms  Reference time: 2.22 ms
 * Unexpected high latency! [mobilenet_v2_1.0_224_quant_edgetpu.tflite]
   Inference time: 3.8395439999999326 ms  Reference time: 2.56 ms
******************** Check finished! *******************

But for the model efficientdet_lite2_448_ptq_edgetpu.tflite, running via examples/detect_image.py, the inference goes down from ~330 ms to ~280 ms, while the benchmark says ~100 ms. So there's still a significant gap here, of more than x2. By the way, does the published benchmark assume libedgetpu1-std or max?

In any case, I wish to avoid using the maximum frequency in my application due to the overheating. So I'm still trying to understand which relevant differences between my Rockchip - RK3399 and the coral dev-board, could best explain the difference in performance. Currently I assume it's the PCIe (dev-board) vs USB 3.1 (Rockchip - RK3399). Unless there are any other important elements that I'm missing (?)

Thanks

hjonnala commented 3 years ago

Hi @noamholz the published benchmark assume libedgetpu1-max. (Since we are doing max performance test we use max frequency).

The benchmarks for efficientdet_lite2_448_ptq_edgetpu.tflite model are measured with a Coral USB Accelerator on a desktop CPU(Single 64-bit Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz). As both(Desktop CPU and Rockchip - RK3399) are different CPU architectures you might be seeing the significant difference.

noamholz commented 3 years ago

Alright, thanks for your answers @hjonnala !

google-coral-bot[bot] commented 3 years ago

Have a few minutes? We'd love your feedback about the Coral developer experience! Take our 5-minute survey.

Are you satisfied with the resolution of your issue? Yes No

google-coral / pycoral

slow inference on ARM64 device #46

Description