quic / qidk

Other
102 stars 23 forks source link

The DSP is slower than the CPU; what could be wrong? #48

Open suhyun01150 opened 5 days ago

suhyun01150 commented 5 days ago

Hello, I have followed the instructions from the link below for an RB5 SM8550 Gen2 Android board: https://github.com/quic/qidk/blob/master/Model-Enablement/Model-Accuracy-Mixed-Precision/Accuracy_Analyzer_YoloV8.ipynb However, I’m reaching out with some questions as the results seem unexpected. My understanding is that the DSP should be faster than the CPU, yet in my tests, the DSP appears to be more than 10 times slower. Here is the code I executed: snpe-onnx-to-dlc --input_network "onnx_model/yolov8n_11.onnx" --output_path "dlc/yolov8.dlc" snpe-net-run --container \$OUTPUT_DLC_FP32 --input_list list.txt --output_dir \$OUTPUT_FOLDER_FP32 --debug snpe-dlc-quantize --input_dlc yolov8.dlc --output_dlc yolov8Q.dlc --input_list quantization_input_list.txt --enable_htp --htp_socs=sm8550 snpe-net-run --container \$OUTPUT_DLC_QUANTIZED8 --input_list list.txt --output_dir \$OUTPUT_FOLDER --use_dsp --debug I compared this with the tutorial and do not see any differences in my approach. Is it expected for the DSP to be slower than the CPU? image

quic-vraidu commented 2 days ago

Hello @suhyun01150,

Please share the details.

  1. Can I know what is snpe version you are using?
  2. Can I know how are you calculating interface time details?
suhyun01150 commented 2 days ago

snpe version: 2.26.2.240911 and the code is

%%bash export DEVICE_SHELL="adb -H $DEVICE_HOST" $DEVICE_SHELL shell " export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/data/local/tmp/snpeexample/aarch64-android/lib export PATH=\$PATH:/data/local/tmp/snpeexample/aarch64-android/bin export OUTPUT_FOLDER=OUTPUT_8b_DSP export OUTPUT_FOLDER_FP32=OUTPUT_FP32_CPU export OUTPUT_FOLDER_FP32_DSP=OUTPUT_FP32_DSP # FP32 DSP용 새 폴더 export OUTPUT_FOLDER_8b_CPU=OUTPUT_8b_CPU # INT8 CPU용 새 폴더 export OUTPUT_DLC_QUANTIZED8=yolov8Q.dlc export OUTPUT_DLC_FP32=yolov8.dlc export ADSP_LIBRARY_PATH='/data/local/tmp/snpeexample/dsp/lib;/system/lib/rfsa/adsp;/system/vendor/lib/rfsa/adsp;/dsp' export ONDEVICE_FOLDER='yolov8_comparision'

cd /data/local/tmp/\$ONDEVICE_FOLDER

새로운 출력 폴더 생성

mkdir -p \$OUTPUT_FOLDER_FP32_DSP mkdir -p \$OUTPUT_FOLDER_8b_CPU

echo '===== FP32 모델 CPU 실행 시작 =====' START_TIME1=\$(date +%s.%N) snpe-net-run --container \$OUTPUT_DLC_FP32 --input_list list.txt --output_dir \$OUTPUT_FOLDER_FP32 --debug END_TIME1=\$(date +%s.%N) DIFF1=\$(echo \"\$END_TIME1 - \$START_TIME1\" | bc) echo \"FP32 CPU 실행 시간: \$DIFF1 초\" echo '===== FP32 모델 CPU 실행 완료 =====' echo

echo '===== Quantized8 모델 DSP 실행 시작 =====' START_TIME2=\$(date +%s.%N) snpe-net-run --container \$OUTPUT_DLC_QUANTIZED8 --input_list list.txt --output_dir \$OUTPUT_FOLDER --use_dsp --debug END_TIME2=\$(date +%s.%N) DIFF2=\$(echo \"\$END_TIME2 - \$START_TIME2\" | bc) echo \"Quantized8 DSP 실행 시간: \$DIFF2 초\" echo '===== Quantized8 모델 DSP 실행 완료 =====' echo

echo '===== FP32 모델 DSP 실행 시작 =====' START_TIME3=\$(date +%s.%N) snpe-net-run --container \$OUTPUT_DLC_FP32 --input_list list.txt --output_dir \$OUTPUT_FOLDER_FP32_DSP --use_dsp --debug END_TIME3=\$(date +%s.%N) DIFF3=\$(echo \"\$END_TIME3 - \$START_TIME3\" | bc) echo \"FP32 DSP 실행 시간: \$DIFF3 초\" echo '===== FP32 모델 DSP 실행 완료 =====' echo

echo '===== Quantized8 모델 CPU 실행 시작 =====' START_TIME4=\$(date +%s.%N) snpe-net-run --container \$OUTPUT_DLC_QUANTIZED8 --input_list list.txt --output_dir \$OUTPUT_FOLDER_8b_CPU --debug END_TIME4=\$(date +%s.%N) DIFF4=\$(echo \"\$END_TIME4 - \$START_TIME4\" | bc) echo \"Quantized8 CPU 실행 시간: \$DIFF4 초\" echo '===== Quantized8 모델 CPU 실행 완료 =====' "

quic-vraidu commented 2 days ago

Can I know DSP and CPU output is proper i.e if the objects are detected proper in both CPU and DSP. Can you also remove the --debug and try it? It is better to run the command directly on the device instead of notebook.

suhyun01150 commented 2 days ago

When checking the output results of DSP and CPU through the post-proc function on the image, the bounding box results appear as expected. Even when running it through ADB on the notebook, the DSP still takes longer than the CPU. The same result occurs even when the debug option is turned off.