Tflite: android benchmark test get differen result on GPU delegate

Hi, When I use the benchmark script and benchmark apk to test my model performance, I get the same performance on CPU of XNNPACK delegate, but different performance on GPU of OpenCL delegate.

And the result

	benchmark script		benchmark apk
Time performance(ms)	cpu	gpu	cpu	gpu
mobilenetv2	5.7	7.4	5.6	4.3
mobilenetv3_small	1.8	4.9	1.7	2.1
mobilenetv3_large	5.3	6	5	3.8

The benchmark apk is almost twice as fast as the benchmark script on GPU.

Q:What's the different between the benchmark script and the apk?

Running on benchmark script

Log of benchmark script:

╰─$ adb install -r -d -g android_aarch64_benchmark_model.apk
╰─$ adb shell /data/local/tmp/android_aarch64_benchmark_model   --graph=/data/local/tmp/mobilenetv3_large.tflite \
  --num_threads=4  --num_runs=50 --use_gpu=true
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Min num runs: [50]
INFO: Num threads: [4]
INFO: Graph: [/data/local/tmp/mobilenetv3_large.tflite]
INFO: #threads used for CPU inference: [4]
INFO: Use gpu: [1]
INFO: Loaded model /data/local/tmp/mobilenetv3_large.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
INFO: GPU delegate created.
VERBOSE: Replacing 126 out of 126 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions for the whole graph.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
INFO: Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.
INFO: The input model file size (MB): 21.9417
INFO: Initialized session in 1342.81ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=72 first=11653 curr=6233 min=5959 max=11653 avg=6835.62 std=903

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=148 first=6394 curr=6988 min=6096 max=8656 avg=6650.52 std=446

INFO: Inference timings in us: Init: 1342809, First inference: 11653, Warmup (avg): 6835.62, Inference (avg): 6650.52
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=108.359 overall=108.359

Running on benchmark apk

╰─$ adb shell am start -S \
  -n org.tensorflow.lite.benchmark/.BenchmarkModelActivity \
  --es args '"--graph=/data/local/tmp/mobilenetv3_large.tflite \
              --num_threads=4 --num_runs=50 --use_gpu=true"'

Log on logcat:

03-04 16:23:46.821  6210  6210 I tflite  : Initialized OpenCL-based API.
03-04 16:23:46.859  6210  6210 I tflite  : Created 1 GPU delegate kernels.
03-04 16:23:46.859  6210  6210 I tflite  : Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.
03-04 16:23:46.859  6210  6210 I tflite  : The input model file size (MB): 21.9417
03-04 16:23:46.859  6210  6210 I tflite  : Initialized session in 856.303ms.
03-04 16:23:46.860  6210  6210 I tflite  : Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
03-04 16:23:47.164  8268  9284 I DisplayFrameSetting: homeToAppEnd pkg=org.tensorflow.lite.benchmark
03-04 16:23:47.363  6210  6210 I tflite  : count=124 first=4601 curr=3901 min=3882 max=4601 avg=4018.02 std=126
03-04 16:23:47.363  6210  6210 I tflite  : Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
03-04 16:23:48.153  1895  3405 I MiuiNetworkPolicy: bandwidth: 0 KB/s, Max bandwidth: 200 KB/s
03-04 16:23:48.365  6210  6210 I tflite  : count=234 first=3909 curr=4552 min=3856 max=5840 avg=4207.86 std=334
03-04 16:23:48.366  6210  6210 I tflite  : Inference timings in us: Init: 856303, First inference: 4601, Warmup (avg): 4018.02, Inference (avg): 4207.86
03-04 16:23:48.366  6210  6210 I tflite  : Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
03-04 16:23:48.366  6210  6210 I tflite  : Memory footprint delta from the start of the tool (MB): init=103.145 overall=103.145

System information

Android Device information: xiaomi-12

google-ai-edge / LiteRT

Tflite: android benchmark test get differen result on GPU delegate #75

Q:What's the different between the benchmark script and the apk?

Running on benchmark script

Running on benchmark apk