Open MaheshRavishankar opened 1 week ago
@ commit 50035d59537fa95e5b3d8b194b6c0883b87f1395 (vs. base 7b58c712a1c6bc1a13fc4525ef07b0030a950d86)
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 105.631 (vs. 90.653, 16.52%↑) | 104.053 | 3.247 |
MobileNetV1\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 27.362 (vs. 24.190, 13.11%↑) | 27.261 | 0.436 |
MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][default-flags,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 83.602 (vs. 75.567, 10.63%↑) | 83.676 | 0.301 |
[Top 3 out of 5 results showed]
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 28.978 (vs. 31.362, 7.60%↓) | 28.946 | 0.784 |
Vit\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 1096.299 (vs. 1182.442, 7.29%↓) | 1101.193 | 16.869 |
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 27.097 (vs. 29.182, 7.15%↓) | 27.375 | 0.874 |
[Top 3 out of 13 results showed]
Benchmark Name | Total Dispatch Size (bytes) |
---|---|
BertLargeTF(stablehlo) [cuda-sm\_80-linux\_gnu-cuda][default-flags,compile-stats] | 168240 (vs. 130696, 28.73%↑) |
MiniLML12H384Uncased(stablehlo) [cuda-sm\_80-linux\_gnu-cuda][default-flags,compile-stats] | 184872 (vs. 145080, 27.43%↑) |
Vit\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 839376 (vs. 694752, 20.82%↑) |
[Top 3 out of 12 results showed]
Benchmark Name | Stream IR Dispatch Count (# of cmd.dispatch ops) |
---|---|
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 365 (vs. 413, 11.62%↓) |
BertLargeTF(stablehlo) [cuda-sm\_80-linux\_gnu-cuda][default-flags,compile-stats] | 365 (vs. 413, 11.62%↓) |
MiniLML12H384Uncased(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt,compile-stats] | 185 (vs. 209, 11.48%↓) |
[Top 3 out of 29 results showed]
For more information:
For dispatch formation, the current logic (and a lot of code-generation) works much better if the consumer uses an identity indexing map for the producer. There is already a pass in dispatch region formation flow that does this for just a convolution op. Make this apply for more general cases.