Closed antiagainst closed 5 days ago
@ commit b918d15fd1fa968cb5c401910f67edd7cd702c58 (vs. base f4279657ef8da12d07f068a37cbd93986edb47d8)
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
matmul\_2562x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul] cuda(none)[full-inference,default-flags] with default @ a2-highgpu-1g[gpu] | 1.534 (vs. 1.368, 12.15%↑) | 1.534 | 0.001 |
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul] cuda(none)[full-inference,default-flags] with default @ a2-highgpu-1g[gpu] | 0.222 (vs. 0.200, 11.16%↑) | 0.222 | 0.000 |
MobileBertSquad\_int8(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][default-flags] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] | 94.955 (vs. 86.395, 9.91%↑) | 95.940 | 2.322 |
[Top 3 out of 4 results showed]
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
matmul\_3456x1024x2048\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul] cuda(none)[full-inference,default-flags] with default @ a2-highgpu-1g[gpu] | 0.130 (vs. 0.166, 21.53%↓) | 0.130 | 0.000 |
MobileBertSquad\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_sync(embedded\_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] | 1069.568 (vs. 1222.156, 12.49%↓) | 1070.319 | 4.894 |
MobileBertSquad\_int8(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] | 584.437 (vs. 652.467, 10.43%↓) | 588.813 | 12.434 |
[Top 3 out of 21 results showed]
Benchmark Name | Total Dispatch Size (bytes) |
---|---|
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] | 11392 (vs. 12864, 11.44%↓) |
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] | 11280 (vs. 12336, 8.56%↓) |
GPT2\_117M\_TF\_1X4XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] | 18224 (vs. 19328, 5.71%↓) |
[Top 3 out of 6 results showed]
Benchmark Name | Stream IR Dispatch Count (# of cmd.dispatch ops) |
---|---|
GPT2\_117M\_TF\_1X4XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] | 330 (vs. 318, 3.77%↑) |
GPT2\_117M\_TF\_1X4XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] | 330 (vs. 318, 3.77%↑) |
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk,compile-stats] | 330 (vs. 318, 3.77%↑) |
[Top 3 out of 10 results showed]
Benchmark Name | Stream IR Dispatch Count (# of cmd.dispatch ops) |
---|---|
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only,compile-stats] | 355 (vs. 367, 3.27%↓) |
GPT2\_117M\_TF\_1X1XI32(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk,compile-stats] | 355 (vs. 367, 3.27%↓) |
GPT2\_117M\_TF\_1X1XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][default-flags,dt-uk,compile-stats] | 355 (vs. 367, 3.27%↓) |
[Top 3 out of 6 results showed]
For more information:
This seems like a fairly likely candidate for the source of dispatch count changes: https://github.com/iree-org/llvm-project/commit/7ef83f5561b34ca07fdef23ca2b3c01c583dbbf5
Especially because the changes are being observed in data tiling enabled benchmarks. cc @Max191
We need to look at the regressions in number of dispatches. I can help (but not today).
@MaheshRavishankar are you blocking the integrate for this or would you look at it in a follow up since Quinn has explained the possible reason for the difference?
We need to look at the regressions in number of dispatches. I can help (but not today).
@MaheshRavishankar are you blocking the integrate for this or would you look at it in a follow up since Quinn has explained the possible reason for the difference?
Could you try reverting that locally to see if that is the issue. Then we can decide what to do next
We need to look at the regressions in number of dispatches. I can help (but not today).
@MaheshRavishankar are you blocking the integrate for this or would you look at it in a follow up since Quinn has explained the possible reason for the difference?
Could you try reverting that locally to see if that is the issue. Then we can decide what to do next
@MaheshRavishankar PTAL at the benchmark comment now, the bot has edited it and it seems the dispatch number regression is gone with the revert.
Updated to llvm/llvm-project@27ac46e6bea2
MathExtras.h
to replace MLIR oneapplySignatureConversion
usageUpdated to openxla/stablehlo@dd48ec5
chlo.minimum_broadcast_shapes
op was removed https://github.com/openxla/stablehlo/pull/2287chlo.dynamic_reshape
op was removed https://github.com/openxla/stablehlo/pull/2286Updated to llvm/torch-mlir@77d7f64