Closed aaron-schneider closed 3 months ago
Can someone add more context to this issue? Is this related to https://github.com/openxla/iree/tree/main/tests/e2e/matmul ?
Can someone add more context to this issue? Is this related to https://github.com/openxla/iree/tree/main/tests/e2e/matmul ?
I updated the description
Modify IREE Distpach profiler to output results in a standard format, e.g. csv. We should choose something that can be converted readily to the format that our benchmark reporting can ingest.
The dispatch profiler reports the performance as follows:
----------------------------------------------------------------
Dispatch : matmul_splitk_512x1024x4096_f32t_f32t_f32t_tile_config_64x64_32x3_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.SplitkMatmul
Operation : matmul_splitk_512x1024x4096_f32t_f32t_f32t
Configuration : tile_config_64x64_32x3_tensorcore_mmasync
Arguments : --batch_count=1, --M=512, --N=1024, --K=4096, --lhs=f32t, --rhs=f32t,
--result=f32t, --split_k_mode=parallel, --split_k_slices=8
Verification : SUCCESS
Runtime(ms) : 0.183
GFLOPs : 23469.77
----------------------------------------------------------------
Dispatch : matmul_splitk_512x1024x4096_f32t_f32t_f32t_tile_config_64x64_16x10_tensorcore_mmasync
Provider : IREE Codegen
OpKind : OperationKind.SplitkMatmul
Operation : matmul_splitk_512x1024x4096_f32t_f32t_f32t
Configuration : tile_config_64x64_16x10_tensorcore_mmasync
Arguments : --batch_count=1, --M=512, --N=1024, --K=4096, --lhs=f32t, --rhs=f32t,
--result=f32t, --split_k_mode=parallel, --split_k_slices=8
Verification : SUCCESS
Runtime(ms) : 0.136
GFLOPs : 31580.64
Writing performance report to data.csv
Additionally, it generates reports in csv format. Example report data.csv.
Provider | op_kind | Operation | bytes | flops | batch_count | M | N | K | lhs | rhs | result | split_k_mode | split_k_slices | Tile config | Core class | Instruction class | Verification | Runtime(ms) | GFLOPs |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IREE Codegen | matmul_splitk | matmul_splitk_512x1024x4096_f32t_f32t_f32t | 27262976 | 4294967296 | 1 | 512 | 1024 | 4096 | f32t | f32t | f32t | parallel | 2 | 128x256_16x3 | tensorcore | mmasync | SUCCESS | 0.182 | 23598.72 |
IREE Codegen | matmul_splitk | matmul_splitk_512x1024x4096_f32t_f32t_f32t | 27262976 | 4294967296 | 1 | 512 | 1024 | 4096 | f32t | f32t | f32t | parallel | 2 | 256x128_16x3 | tensorcore | mmasync | SUCCESS | 0.182 | 23598.72 |
Define statistics over the output that we want track for progression and regression purposes
Can we define this with an example? What are we looking for here?
Define statistics over the output that we want track for progression and regression purposes
Can we define this with an example? What are we looking for here?
Imagine that we profile ~1000 individual dispatches in a single sweep. If we want to run regression testing on this sweep, we shouldn't do this on a 1000 separate series. Ideally, we'd detect regressions on some statistic or several statistics over the set. The simplest (but I suspect a poor choice) would be just to track the average... but maybe instead we track a histrogram with configurable N and run regressions on that?
@qcolombet : Command lines to generate the matmuls and profile them are as follows:
iree-build $ python3 ../iree/experimental/dispatch_profiler/generator.py
iree-build $ python3 ../iree/experimental/dispatch_profiler/profiler.py --output=data.csv
python3 ../iree/experimental/dispatch_profiler/generator.py
@manishucsd I was expecting this to see multiple invocations for each desired sweep, each with its own specific arguments? is that not not what we're shooting for?
The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.
The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.
I see. How long does this take to run? I expected that we would want to specify specific sweeps, etc. so that the benchmark is bounded?
The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.
I see. How long does this take to run? I expected that we would want to specify specific sweeps, etc. so that the benchmark is bounded?
If you just run the default commands (i.e., no filtering), it takes about 2 min on an A100 machine with a weak CPU (compile time included). This will go up as we add the unaligned cases. But then we can start thinking of filtering if that becomes a problem.
Thanks for the info. I think I put my finger on why I expected this filtering to happen at benchmark generation time: a) I'm assuming that we want multiple distinct corpuses of micro-benchmarks over which we create reports, have some level of statistics-level tracking and regression detection, and can easily be repro'ed and analyzed by engineers. We want these corpuses to remain stable over time, i.e. we don't want adding new benchmark capabilities to IREE Dispatch Profiler to alter the results of existing benchmarks. b) We could accomplish (a) with filtering after the run... but if we do this filtering at generation time, the purpose of each benchmark is clear, the repro steps are clear, and the results are easy to grok. Filtering after the run would require an extra tool... and we wouldn't want to this to be a concern of the benchmarking system itself.
Does that make sense?
Thanks for confirming. We are working on IREE dispatch profiler to add more features and speed-up the compilation times. It does filtering when passed --dispatches=<regex>
. We plan to to add command line arguments to generate and profile different op shapes by passing things from command line and not. changing the default shapes in the IREE dispatch profiler python code.
@julianwa This Epic looks on track, WIP. Please provide any updates or risks.
I have moved all the issues related to IREE dispatch profiler another epic #13494. This epic strictly contains issues related to CI. cc: @mattwalsh , @allieculp
@julianwa Please update this Epic!
@julianwa Please update this epic.
Dispatch profiler code was removed in https://github.com/iree-org/iree/commit/c2114b897a8a53cd8d0edde1024420dc64a9cbdd, so going to close this issue.
We've had a few other versions of matmul benchmark suites too. Most recently https://github.com/nod-ai/rocm-gemm-benchmark/ (which has a nightly CI). Might restart some continuous benchmarking work here in iree-org soon.
We want to use https://github.com/openxla/iree/tree/main/experimental/dispatch_profiler to report performance statistics over corpuses of matmul micro-benchmarks, and ideally detect regressions in those statistics.