E2E tests suite for the Attention.

erman-gurses commented 1 month ago

Request description

E2E tests suite for the Attention that has reference implementation in it.

What component(s) does this issue relate to?

Compiler

Additional context

I raised a PR: https://github.com/iree-org/iree/pull/17751.

ScottTodd commented 1 month ago

From @raikonenfnu :

I am thinking it would be beneficial to have a pkgCI that checks numerics and performance of attention operations. Here are some of my ideas:

Have a json + jinja + template MLIR to generate the shape (in the future, variant) of attention we care about

Using pytorch on CPU/GPU to generate the inputs/numpy + to compute SDPA results instead of shuttling numpys around.

Just like current PkgCI we would specify the compile flag right on the python code of course.

I'd like to be organized about this, considering the growing number of test suites while also recognizing the importance of test coverage for 'attention' ops. I'll write up some of my own thoughts too.

ScottTodd commented 1 month ago

Existing test suites

This page documents many of the tests that we have in-tree and out-of-tree: https://iree.dev/developers/general/testing-guide/. Notably missing there are explanations for tests/e2e/matmul and tests/e2e/convolution, which the draft PR https://github.com/iree-org/iree/pull/17751 is based on. Attention is pretty similar to matmul and convolution in that it is a high level ML op with multiple implementation paths that we want coverage for across a sweep of input dimensions/values, compiler backends, and HAL drivers.

Matmul and convolution generated tests

The in-tree convolution and matmul tests work like this:

Test suites are declared in a BUILD.bazel file like tests/e2e/matmul/BUILD.bazel using the iree_generated_e2e_runner_test() function, implemented in Bazel via build_tools/bazel/iree_e2e_generated_runner_test.bzl and CMake via build_tools/cmake/iree_e2e_generated_runner_test.cmake.
Those suites are converted from Bazel to CMake with code in build_tools/bazel_to_cmake/bazel_to_cmake_converter.py. The conversion process runs some of the logic in the Bazel file to generate a CMakeLists.txt like tests/e2e/matmul/CMakeLists.txt
Each test case in a suite runs a Python generator script like e2e/matmul/generate_e2e_matmul_tests.py to generate a set of e.g. matmul.mlir and calls.mlir files.
At build time (not test time), those .mlir files are compiled using iree-compile with the specified flags by the iree_bytecode_module() build system functions
At test time, the compiled .vmfb files are ran through a test runner program like tools/testing/e2e/iree-e2e-matmul-test.cc, which compares the results of the compiled program against a reference CPU implementation written in C/C++.

Pytest regression tests

https://github.com/iree-org/iree/tree/main/experimental/regression_suite

These tests are starting to get more focus, as a way to stay in Python and pass any flags.

Pytest iree_tests (onnx + others) test suite

https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests

These tests are generated offline from existing test suites and focus on using default flags, with a focus on large sets of XFAILs to track across in-tree and out-of-tree plugins.

Brainstorming

I've had my eye on refactoring the entire tests/e2e/ folder lately, with some ideas tracked on https://github.com/iree-org/iree/issues/17868. My main motivations are:

Supporting two build systems (Bazel and CMake) for testing is very complicated.
The ML frameworks we use an frontends are written in Python, so testing would benefit from direct use of Python packages
- That being said, the core project remains focused on C/C++ and CLI tools, so exporting to something that can be debugged step by step is critical
Filtering in Bazel and CMake is tricky, and we need to be able to filter and mark XFAIL to test effectively across devices and configurations
Generated test suites are difficult to inspect. If we move the tests out of tree (or accept the repository bloat in-tree), we can run the generator offline and commit the test cases. That can use Git LFS or cloud storage like Azure or GCS as needed for large artifacts.

I think Stan is on the right track here for a solution I'd be happier with, but depending on prioritization and implementation complexity, forking the existing matmul/convolution test suites may be more expedient.

raikonenfnu commented 1 month ago

Thanks for the state of affairs WRT test suite @ScottTodd! I think you covered the pros and cons very well. I think forking the matmul/convolution would work, although I'd like slight changes to them to fit our current requirement for the coming attention test.

Better naming schemes for ShapeId (i.e we should have shapeIds for Llama, SDXL, SD3, etc) [link to current shapeIds on matmul]
Using compile command flags to test different compilation phase as opposed to using compilation info is being set through baked in python heuristics, [link to python heuristics]. This is important because we would like to test/compare perf between spec MLIRs, c++ pipelines, or TK kernels. Which is more representative of the ways we compile models/IRs.
Performance tests, right now we are only checking numerics, would be good to also start tracking perf.

Optionally, IMO having some jinja + template MLIR + DB refactoring, would also help cleanup and make adding tests for variants of ops much easier, although we can do that later.

CC: @erman-gurses

ScottTodd commented 1 month ago

RE: performance tests / benchmarks, I think an offline generation approach helps manage complexity there too:

A Python script generates the sweep across dimensions along with reference outputs
A harness (e.g. pytest) chooses which configurations to run
The harness can run tests, comparing outputs with the reference outputs
The harness can run benchmarks, measuring latency, collecting traces, etc.

The thing to look for there is where a line can be drawn between discrete steps. A build system could add sugar that combines steps (like the iree-run-tests utility target), but starting with the pieces separated lets us more easily pick a point to branch off and do something else with the artifacts.

raikonenfnu commented 1 month ago

RE: performance tests / benchmarks, I think an offline generation approach helps manage complexity there too:

A Python script generates the sweep across dimensions along with reference outputs

A harness (e.g. pytest) chooses which configurations to run

The harness can run tests, comparing outputs with the reference outputs

The harness can run benchmarks, measuring latency, collecting traces, etc.

The thing to look for there is where a line can be drawn between discrete steps. A build system could add sugar that combines steps (like the iree-run-tests utility target), but starting with the pieces separated lets us more easily pick a point to branch off and do something else with the artifacts.

That makes sense! We are quite constrained on time though, and need these tests and benchmarking up and running ASAP for the coming sprint and various optimization works that we are going to have on attention.

Based on your CI experience and expertise, do you think we can get here by EoW? (i.e landing @erman-gurses's base e2e attention test + wiring in the easy drop in compile cmd flags + benchmark performance?) If not, what is the estimate you think it will take for us to get there, and if it's worth it setting up a temporary pkgCI in python that can do this? :)

ScottTodd commented 1 month ago

For just testing, 1 week seems like a reasonable target.

I wouldn't want to retrofit benchmarking onto the iree_generated_e2e_runner_test code, at least staying in the context of CMake or Bazel. We could write something separate that goes through and runs the files that are generated like

e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx2_calls.mlir
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx2_calls.vmfb
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx2_matmul.mlir
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx2_matmul.vmfb
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx512_calls.mlir
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx512_calls.vmfb
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx512_matmul.mlir
e2e_matmul_cpu_dt_f16_f16_large_llvm-cpu_local-task_avx512_matmul.vmfb

though. I don't know off the top of my head if those are expected to work with iree-run-module / iree-benchmark-module directly or if they need to be run through iree-e2e-matmul-test.

Benchmarking also needs a very careful eye... these single op tests are microbenchmarks while IREE isn't designed (exclusively) to perform well on. IREE is a whole program compiler. We have some dotprod / matmul microbenchmark suites (last updated in https://github.com/iree-org/iree/pull/17748, see the source at https://github.com/iree-org/iree-comparative-benchmark/blob/main/common_benchmark_suite/openxla/benchmark/comparative_suite/jax/scripts/generate_model_artifacts.py)

I'm not really sure what we should be doing for benchmarking. We currently "support" multiple generations of benchmarking infrastructure (longitudinal on perf.iree.dev, comparative with iree-comparative-benchmark, sdxl regression in pkgci, etc.) and this is different in its own ways, so we'd be looking at adding yet another system. What I'd be looking for is something simple enough that we can iterate on it easily and then delete it / promote it somewhere / etc. as we get some mileage on it and decide what we want to do with it. Files + tools in, .csv file or text logs out -> feed output into a spreadsheet, learn something, iterate.

raikonenfnu commented 1 month ago

RE: Benchmarking

Agreed, there is already so much system. Ideally, we can piggyback off one of those haha. Would you think it's better to have a benchmarking system more interconnected to this test or do we just write a separate PkgCI that does only perf checks? (sorry for the constant comparison to pkgCI, that's the only CI I am most familiar with :) )

ScottTodd commented 1 month ago

Pkgci is just a set of actions that

build the python packages and install them
run tests with the installed packages

The older ci.yml workflows rely on a tighter coupling between the build (CMake) and the tests (also CMake). The matmul and convolution test suites are covered by ci.yml as they are deeply integrated with Bazel and CMake.

So for new tests (and benchmarks), I'd much prefer for them to just operate on installed packages. That can either mean using IREE's Python APIs or just working with iree-compile and iree-run-module that are put on PATH. We can put whatever code we want in those tests/benchmarks, they just can't rely on being included in IREE's core build system / workspace.

raikonenfnu commented 1 month ago

Doesn't pkgCI also use the built iree-compile/benchmark-module by somwhat pointing to the build path as seen https://github.com/iree-org/iree/blob/00daa029942a28fea716a0730085dfd06a82dd31/experimental/regression_suite/ireers/fixtures.py#L38-L60 ?

Also do you have thoughts/ideas on best way to build out the attention perf benchmarks would you recommend any one of these existing frameworks (longitudinal on perf.iree.dev, comparative with iree-comparative-benchmark, sdxl regression in pkgci, etc.)?

ScottTodd commented 1 month ago

Doesn't pkgCI also use the built iree-compile/benchmark-module by somwhat pointing to the build path

Those tools are available as console scripts after installing python packages: https://github.com/iree-org/iree/blob/3dffadbc0b8a37ab170a61eddb3131c8cbd8c2b2/compiler/setup.py#L454-L463 https://github.com/iree-org/iree/blob/3dffadbc0b8a37ab170a61eddb3131c8cbd8c2b2/runtime/setup.py#L599-L609

Pkgci is completely decoupled from the build system. Think about using packages (since that's what anyone outside of a direct project developer should be using), not about interfacing with anything in CMake/Bazel.

ScottTodd commented 1 month ago

Also do you have thoughts/ideas on best way to build out the attention perf benchmarks would you recommend any one of these existing frameworks (longitudinal on perf.iree.dev, comparative with iree-comparative-benchmark, sdxl regression in pkgci, etc.)?

perf.iree.dev is unmaintained and may break at any point. The knowledge to maintain it is fading and resourcing is hard to find. Unfortunately, it also has the most robust support for configurations and tracking of our options :/
idk what is happening with iree-comparative-benchmark. As far as I know it is half-finished and never got substantial traction.
sdxl regression tracking in pkgci is new and v0 - needs several more rounds of design iteration before I'd put more weight on it

Start simple. Have something write to json / xml that can be visualized somehow. Put historical data in a folder in a github repo or cloud document folder. Find a way to automate parts of that later.

raikonenfnu commented 1 month ago

Pkgci is completely decoupled from the build system. Think about using packages (since that's what anyone outside of a direct project developer should be using), not about interfacing with anything in CMake/Bazel.

Seems like we can also modify those underlying implementation (i.e pkgCI's iree_compile/iree_run_module function to use python packages installed during the build) and in parallel also try setting up the attention perf benchmark with existing "API"s. Any thoughts?

CC: @suryajasper who I heard from Harsh is going to expand his gemm-ai benchmark to attention-ai-benchmark :)

erman-gurses commented 1 week ago

The related PR is ready to land: https://github.com/iree-org/iree/pull/17751.

iree-org / iree