[Epic] Matmul benchmarking and regression testing in CI

aaron-schneider commented 1 year ago

We want to use https://github.com/openxla/iree/tree/main/experimental/dispatch_profiler to report performance statistics over corpuses of matmul micro-benchmarks, and ideally detect regressions in those statistics.

### Tasks
- [x] #12943 
- [x] #12919
- [ ] #13464
- [ ] Define statistics over the output that we want track for progression and regression purposes

ScottTodd commented 1 year ago

Can someone add more context to this issue? Is this related to https://github.com/openxla/iree/tree/main/tests/e2e/matmul ?

julianwa commented 1 year ago

Can someone add more context to this issue? Is this related to https://github.com/openxla/iree/tree/main/tests/e2e/matmul ?

I updated the description

manishucsd commented 1 year ago

Modify IREE Distpach profiler to output results in a standard format, e.g. csv. We should choose something that can be converted readily to the format that our benchmark reporting can ingest.

The dispatch profiler reports the performance as follows:

---------------------------------------------------------------- 
Dispatch      : matmul_splitk_512x1024x4096_f32t_f32t_f32t_tile_config_64x64_32x3_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.SplitkMatmul
Operation     : matmul_splitk_512x1024x4096_f32t_f32t_f32t
Configuration : tile_config_64x64_32x3_tensorcore_mmasync
Arguments     : --batch_count=1, --M=512, --N=1024, --K=4096, --lhs=f32t, --rhs=f32t,
                --result=f32t, --split_k_mode=parallel, --split_k_slices=8
Verification  : SUCCESS
Runtime(ms)   : 0.183
GFLOPs        : 23469.77
---------------------------------------------------------------- 
Dispatch      : matmul_splitk_512x1024x4096_f32t_f32t_f32t_tile_config_64x64_16x10_tensorcore_mmasync
Provider      : IREE Codegen
OpKind        : OperationKind.SplitkMatmul
Operation     : matmul_splitk_512x1024x4096_f32t_f32t_f32t
Configuration : tile_config_64x64_16x10_tensorcore_mmasync
Arguments     : --batch_count=1, --M=512, --N=1024, --K=4096, --lhs=f32t, --rhs=f32t,
                --result=f32t, --split_k_mode=parallel, --split_k_slices=8
Verification  : SUCCESS
Runtime(ms)   : 0.136
GFLOPs        : 31580.64
Writing performance report to data.csv

Additionally, it generates reports in csv format. Example report data.csv.

Provider	op_kind	Operation	bytes	flops	batch_count	M	N	K	lhs	rhs	result	split_k_mode	split_k_slices	Tile config	Core class	Instruction class	Verification	Runtime(ms)	GFLOPs
IREE Codegen	matmul_splitk	matmul_splitk_512x1024x4096_f32t_f32t_f32t	27262976	4294967296	1	512	1024	4096	f32t	f32t	f32t	parallel	2	128x256_16x3	tensorcore	mmasync	SUCCESS	0.182	23598.72
IREE Codegen	matmul_splitk	matmul_splitk_512x1024x4096_f32t_f32t_f32t	27262976	4294967296	1	512	1024	4096	f32t	f32t	f32t	parallel	2	256x128_16x3	tensorcore	mmasync	SUCCESS	0.182	23598.72

manishucsd commented 1 year ago

Define statistics over the output that we want track for progression and regression purposes

Can we define this with an example? What are we looking for here?

julianwa commented 1 year ago

Define statistics over the output that we want track for progression and regression purposes

Can we define this with an example? What are we looking for here?

Imagine that we profile ~1000 individual dispatches in a single sweep. If we want to run regression testing on this sweep, we shouldn't do this on a 1000 separate series. Ideally, we'd detect regressions on some statistic or several statistics over the set. The simplest (but I suspect a poor choice) would be just to track the average... but maybe instead we track a histrogram with configurable N and run regressions on that?

manishucsd commented 1 year ago

@qcolombet : Command lines to generate the matmuls and profile them are as follows:

iree-build $ python3 ../iree/experimental/dispatch_profiler/generator.py 
iree-build $ python3 ../iree/experimental/dispatch_profiler/profiler.py --output=data.csv

julianwa commented 1 year ago

python3 ../iree/experimental/dispatch_profiler/generator.py

@manishucsd I was expecting this to see multiple invocations for each desired sweep, each with its own specific arguments? is that not not what we're shooting for?

manishucsd commented 1 year ago

The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.

julianwa commented 1 year ago

The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.

I see. How long does this take to run? I expected that we would want to specify specific sweeps, etc. so that the benchmark is bounded?

qcolombet commented 1 year ago

The command line above will generate and profile whatever is present in the dispatch profiler by default. One can filter a subset of it by providing command line arguments.

I see. How long does this take to run? I expected that we would want to specify specific sweeps, etc. so that the benchmark is bounded?

If you just run the default commands (i.e., no filtering), it takes about 2 min on an A100 machine with a weak CPU (compile time included). This will go up as we add the unaligned cases. But then we can start thinking of filtering if that becomes a problem.

julianwa commented 1 year ago

Thanks for the info. I think I put my finger on why I expected this filtering to happen at benchmark generation time: a) I'm assuming that we want multiple distinct corpuses of micro-benchmarks over which we create reports, have some level of statistics-level tracking and regression detection, and can easily be repro'ed and analyzed by engineers. We want these corpuses to remain stable over time, i.e. we don't want adding new benchmark capabilities to IREE Dispatch Profiler to alter the results of existing benchmarks. b) We could accomplish (a) with filtering after the run... but if we do this filtering at generation time, the purpose of each benchmark is clear, the repro steps are clear, and the results are easy to grok. Filtering after the run would require an extra tool... and we wouldn't want to this to be a concern of the benchmarking system itself.

Does that make sense?

manishucsd commented 1 year ago

Thanks for confirming. We are working on IREE dispatch profiler to add more features and speed-up the compilation times. It does filtering when passed --dispatches=<regex>. We plan to to add command line arguments to generate and profile different op shapes by passing things from command line and not. changing the default shapes in the IREE dispatch profiler python code.

allieculp commented 1 year ago

@julianwa This Epic looks on track, WIP. Please provide any updates or risks.

manishucsd commented 1 year ago

I have moved all the issues related to IREE dispatch profiler another epic #13494. This epic strictly contains issues related to CI. cc: @mattwalsh , @allieculp

allieculp commented 1 year ago

@julianwa Please update this Epic!

allieculp commented 1 year ago

@julianwa Please update this epic.

ScottTodd commented 3 months ago

Dispatch profiler code was removed in https://github.com/iree-org/iree/commit/c2114b897a8a53cd8d0edde1024420dc64a9cbdd, so going to close this issue.

We've had a few other versions of matmul benchmark suites too. Most recently https://github.com/nod-ai/rocm-gemm-benchmark/ (which has a nightly CI). Might restart some continuous benchmarking work here in iree-org soon.

iree-org / iree

[Epic] Matmul benchmarking and regression testing in CI #13259