iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.87k stars 625 forks source link

GPT2_117M_TF with ukernel-enabled runs slower on Pixel 8 than 6 #16084

Open pzread opened 10 months ago

pzread commented 10 months ago

Observed in https://github.com/openxla/iree/pull/15796, GPT2_117M_TF_1X1XI32 and GPT2_117M_TF_1X4XI32 run much slower on Pixel 8 when data-tiling + ukernel is enabled.

I downloaded the GPT2_117M_TF_1X1XI32 and tested it manually on the Pixel 6 and 8. Here are the results:

Pixel 6:

Running ./iree-benchmark-module
Run on (8 X 1803 MHz CPU s)
Load Average: 0.85, 0.91, 0.96
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time              25.1 ms         25.2 ms           29 items_per_second=39.8461/s
BM_forward/process_time/real_time              25.4 ms         25.4 ms           29 items_per_second=39.4384/s
BM_forward/process_time/real_time              25.3 ms         25.4 ms           29 items_per_second=39.5515/s
BM_forward/process_time/real_time              25.4 ms         25.3 ms           29 items_per_second=39.3442/s
BM_forward/process_time/real_time              25.1 ms         25.2 ms           29 items_per_second=39.7877/s
BM_forward/process_time/real_time              25.0 ms         25.1 ms           29 items_per_second=40.0569/s
BM_forward/process_time/real_time              25.1 ms         25.3 ms           29 items_per_second=39.7725/s
BM_forward/process_time/real_time              25.5 ms         25.4 ms           29 items_per_second=39.2776/s
BM_forward/process_time/real_time              25.3 ms         25.4 ms           29 items_per_second=39.4798/s
BM_forward/process_time/real_time              25.4 ms         25.5 ms           29 items_per_second=39.3492/s
BM_forward/process_time/real_time_mean         25.3 ms         25.3 ms           10 items_per_second=39.5904/s
BM_forward/process_time/real_time_median       25.3 ms         25.3 ms           10 items_per_second=39.5156/s
BM_forward/process_time/real_time_stddev      0.165 ms        0.134 ms           10 items_per_second=0.260023/s
BM_forward/process_time/real_time_cv           0.65 %          0.53 %            10 items_per_second=0.66%

Pixel 8:

Running ./iree-benchmark-module
Run on (9 X 1704 MHz CPU s)
Load Average: 1.03, 1.00, 1.00
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time              78.6 ms         76.4 ms           15 items_per_second=12.7166/s
BM_forward/process_time/real_time               117 ms          117 ms           15 items_per_second=8.51359/s
BM_forward/process_time/real_time               117 ms          117 ms           15 items_per_second=8.56645/s
BM_forward/process_time/real_time               107 ms          107 ms           15 items_per_second=9.34617/s
BM_forward/process_time/real_time               117 ms          116 ms           15 items_per_second=8.58061/s
BM_forward/process_time/real_time               116 ms          116 ms           15 items_per_second=8.59239/s
BM_forward/process_time/real_time              60.9 ms         60.8 ms           15 items_per_second=16.4288/s
BM_forward/process_time/real_time              48.5 ms         48.5 ms           15 items_per_second=20.6194/s
BM_forward/process_time/real_time               110 ms          110 ms           15 items_per_second=9.05357/s
BM_forward/process_time/real_time               109 ms          109 ms           15 items_per_second=9.14409/s
BM_forward/process_time/real_time_mean         98.2 ms         97.8 ms           10 items_per_second=11.1562/s
BM_forward/process_time/real_time_median        110 ms          110 ms           10 items_per_second=9.09883/s
BM_forward/process_time/real_time_stddev       25.8 ms         25.9 ms           10 items_per_second=4.19563/s
BM_forward/process_time/real_time_cv          26.25 %         26.48 %            10 items_per_second=37.61%

To reproduce

# Download VMFB
gcloud storage cp \
  'gs://iree-github-actions-presubmit-artifacts/7451637100/1/e2e-test-artifacts/iree_module_GPT2_117M_TF_1X1XI32_stablehlo___armv9-a-generic-linux_android34-llvm_cpu__default-flags_dt-uk_/module.vmfb' \
  /tmp

# Download benchmark tool
gcloud storage cp -r gs://iree-github-actions-presubmit-artifacts/7451637100/1/benchmark-tools /tmp
tar -xvf /tmp/benchmark-tools/android-armv8.2-a-benchmark-tools.tar
# Use android-armv8.2-a-benchmark-tools-dir/build/tools/iree-benchmark-module

Push files to Pixel 8 and runs:

./iree-benchmark-module \
  --module=module.vmfb \
  --function=forward \
  --device_allocator=caching \
  --device=local-task \
  --input=1x1xi32=0 \
  --input=12x2x1x12x4x64xf32=0 \
  --benchmark_repetitions=10 \
  --task_topology_cpu_ids=0

To compile the VMFB:

gcloud storage cp -r "gs://iree-github-actions-presubmit-artifacts/7451637100/1/e2e-test-artifacts/model_GPT2_117M_TF_1X1XI32.mlir" /tmp/

iree-compile \
  /tmp/model_GPT2_117M_TF_1X1XI32.mlir \
  -o module.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-input-type=stablehlo \
  --iree-llvmcpu-target-triple=aarch64-none-linux-android34 \
  --iree-opt-data-tiling=true \
  --iree-llvmcpu-enable-ukernels=all \
  --iree-llvmcpu-target-cpu-features=+v9a,+fullfp16,+fp-armv8,+neon,+aes,+sha2,+crc,+lse,+rdm,+complxnum,+rcpc,+sha3,+sm4,+dotprod,+fp16fml,+dit,+flagm,+ssbs,+sb,+sve2-aes,+sve2-bitperm,+sve2-sha3,+sve2-sm4,+altnzcv,+fptoint,+bf16,+i8mm,+bti,+mte,+pauth,+perfmon,+predres,+spe,+ras
mariecwhite commented 10 months ago

There's a lot of variation in the Pixel 8 latencies, even for single threaded. Was the cpu frequency fixed?

mariecwhite commented 10 months ago

Also, for Android T and above, you can add settaskprofile $$ MaxPerformance; <benchmark command> to make sure the benchmark is the top app.

pzread commented 10 months ago

We are currently using build_tools/benchmarks/set_android_scaling_governor.sh and at least the script ran successfully.

Another thing I noticed that we are not using cooling plate to cool Pixel 8, which might also be a reason. Sent #16090 to revert the migration before we know the numbers are stable