llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.78k stars 11.9k forks source link

[mlir][arith] Add more canonicalization and integration tests coverage #92272

Closed pingshiyu closed 3 months ago

pingshiyu commented 5 months ago

I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations.

It found a bunch of bugs in the arith folders, optimisations and lowering.

However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter.

I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p).

The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the Integration folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code).

I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests.

Please do let me know if adding these would be helpful, and what's the preferred format for them.

Once/if that's all agreed upon I can add the rest of the regression suite!

llvmbot commented 5 months ago

@llvm/pr-subscribers-mlir-arith

Author: Jacob Yu (pingshiyu)

Changes I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations. It found a bunch of bugs in the `arith` folders, optimisations and lowering. However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter. I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p). The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the `Integration` folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code). I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests. Please do let me know if adding these would be helpful, and what's the preferred format for them. Once/if that's all agreed upon I can add the rest of the regression suite! --- Full diff: https://github.com/llvm/llvm-project/pull/92272.diff 2 Files Affected: - (modified) mlir/test/Dialect/Arith/canonicalize.mlir (+30) - (added) mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir (+52) ``````````diff diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir index e4f95bb0545a2..aa79ad1bead56 100644 --- a/mlir/test/Dialect/Arith/canonicalize.mlir +++ b/mlir/test/Dialect/Arith/canonicalize.mlir @@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 { return %minsi : i0 } +// CHECK-LABEL: @extsiOnI1 +// CHECK: %[[TRUE:.*]] = arith.constant true +// CHECK: %[[CST:.*]] = arith.constant -1 : i16 +// CHECK: return %[[TRUE]], %[[CST]] +func.func @extsiOnI1() -> (i1, i16) { + %true = arith.constant -1 : i1 + %0 = arith.extsi %true : i1 to i16 + return %true, %0 : i1, i16 +} + +// CHECK-LABEL: @extuiOn1I1 +// CHECK: %[[TRUE:.*]] = arith.constant true +// CHECK: %[[CST:.*]] = arith.constant 1 : i64 +// CHECK: return %[[TRUE]], %[[CST]] +func.func @extuiOn1I1() -> (i1, i64) { + %true = arith.constant true + %0 = arith.extui %true : i1 to i64 + return %true, %0 : i1, i64 +} + +// CHECK-LABEL: @trunciI16ToI8 +// CHECK: %[[CST:.*]] = arith.constant 20194 : i16 +// CHECK: %[[CST2:.*]] = arith.constant -30 : i8 +// CHECK: return %[[CST]], %[[CST2]] +func.func @trunciI16ToI8() -> (i16, i8) { + %c20194_i16 = arith.constant 20194 : i16 + %0 = arith.trunci %c20194_i16 : i16 to i8 + return %c20194_i16, %0 : i16, i8 +} + // CHECK-LABEL: @mulsi_extended_i0 // CHECK: %[[ZERO:.*]] = arith.constant 0 : i0 // CHECK: return %[[ZERO]], %[[ZERO]] : i0 diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir new file mode 100644 index 0000000000000..cd1cad4d56da9 --- /dev/null +++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir @@ -0,0 +1,52 @@ +// Regression test suite from an independent development of Arith's +// semantics and a fuzzer + +// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \ +// RUN: --convert-func-to-llvm --convert-arith-to-llvm | \ +// RUN: mlir-cpu-runner -e entry -entry-point-result=void \ +// RUN: --shared-libs=%mlir_c_runner_utils | \ +// RUN: FileCheck %s --match-full-lines + +module { + func.func @extsiOnI1() { + %true = arith.constant -1 : i1 + %0 = arith.extsi %true : i1 to i16 + vector.print %true : i1 + vector.print %0 : i16 + return + } + + func.func @extuiOn1I1() { + %true = arith.constant true + %0 = arith.extui %true : i1 to i64 + vector.print %true : i1 + vector.print %0 : i64 + return + } + + func.func @trunciI16ToI8() { + %c20194_i16 = arith.constant 20194 : i16 + %0 = arith.trunci %c20194_i16 : i16 to i8 + vector.print %c20194_i16 : i16 + vector.print %0 : i8 + return + } + + func.func @entry() { + + // CHECK: 1 + // CHECK: -1 + func.call @extsiOnI1() : () -> () + + // CHECK: 1 + // CHECK: 1 + func.call @extuiOn1I1() : () -> () + + // CHECK: 20194 + // CHECK: -30 + func.call @trunciI16ToI8() : () -> () + + + return + } +} \ No newline at end of file ``````````
llvmbot commented 5 months ago

@llvm/pr-subscribers-mlir

Author: Jacob Yu (pingshiyu)

Changes I've recently been writing some fuzzers/interpreters to validate passes in MLIR for incorrect compilations. It found a bunch of bugs in the `arith` folders, optimisations and lowering. However, during the development process, I ran into a lot more bugs within my own interpreter. This resulted in a fairly amount of regression tests for the interpreter. I thought it might be a good idea to add these into the MLIR codebase - these could guard against potential bugs that may be introduced in the future, for example (if they made the same mistakes as I did :p). The tests are mostly end-to-end ones so it seems to me are best adapted as tests in the `Integration` folder. However, it could also be adapted as canonicalization tests (caveat being it can't exercise lowering code). I've added a few ways these can be adapted on the files in the initial commit, which includes just a few of the unit tests. Please do let me know if adding these would be helpful, and what's the preferred format for them. Once/if that's all agreed upon I can add the rest of the regression suite! --- Full diff: https://github.com/llvm/llvm-project/pull/92272.diff 2 Files Affected: - (modified) mlir/test/Dialect/Arith/canonicalize.mlir (+30) - (added) mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir (+52) ``````````diff diff --git a/mlir/test/Dialect/Arith/canonicalize.mlir b/mlir/test/Dialect/Arith/canonicalize.mlir index e4f95bb0545a2..aa79ad1bead56 100644 --- a/mlir/test/Dialect/Arith/canonicalize.mlir +++ b/mlir/test/Dialect/Arith/canonicalize.mlir @@ -3022,6 +3022,36 @@ func.func @minsi_i0() -> i0 { return %minsi : i0 } +// CHECK-LABEL: @extsiOnI1 +// CHECK: %[[TRUE:.*]] = arith.constant true +// CHECK: %[[CST:.*]] = arith.constant -1 : i16 +// CHECK: return %[[TRUE]], %[[CST]] +func.func @extsiOnI1() -> (i1, i16) { + %true = arith.constant -1 : i1 + %0 = arith.extsi %true : i1 to i16 + return %true, %0 : i1, i16 +} + +// CHECK-LABEL: @extuiOn1I1 +// CHECK: %[[TRUE:.*]] = arith.constant true +// CHECK: %[[CST:.*]] = arith.constant 1 : i64 +// CHECK: return %[[TRUE]], %[[CST]] +func.func @extuiOn1I1() -> (i1, i64) { + %true = arith.constant true + %0 = arith.extui %true : i1 to i64 + return %true, %0 : i1, i64 +} + +// CHECK-LABEL: @trunciI16ToI8 +// CHECK: %[[CST:.*]] = arith.constant 20194 : i16 +// CHECK: %[[CST2:.*]] = arith.constant -30 : i8 +// CHECK: return %[[CST]], %[[CST2]] +func.func @trunciI16ToI8() -> (i16, i8) { + %c20194_i16 = arith.constant 20194 : i16 + %0 = arith.trunci %c20194_i16 : i16 to i8 + return %c20194_i16, %0 : i16, i8 +} + // CHECK-LABEL: @mulsi_extended_i0 // CHECK: %[[ZERO:.*]] = arith.constant 0 : i0 // CHECK: return %[[ZERO]], %[[ZERO]] : i0 diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir new file mode 100644 index 0000000000000..cd1cad4d56da9 --- /dev/null +++ b/mlir/test/Integration/Dialect/Arith/CPU/test-semantics.mlir @@ -0,0 +1,52 @@ +// Regression test suite from an independent development of Arith's +// semantics and a fuzzer + +// RUN: mlir-opt %s --convert-scf-to-cf --convert-cf-to-llvm --convert-vector-to-llvm \ +// RUN: --convert-func-to-llvm --convert-arith-to-llvm | \ +// RUN: mlir-cpu-runner -e entry -entry-point-result=void \ +// RUN: --shared-libs=%mlir_c_runner_utils | \ +// RUN: FileCheck %s --match-full-lines + +module { + func.func @extsiOnI1() { + %true = arith.constant -1 : i1 + %0 = arith.extsi %true : i1 to i16 + vector.print %true : i1 + vector.print %0 : i16 + return + } + + func.func @extuiOn1I1() { + %true = arith.constant true + %0 = arith.extui %true : i1 to i64 + vector.print %true : i1 + vector.print %0 : i64 + return + } + + func.func @trunciI16ToI8() { + %c20194_i16 = arith.constant 20194 : i16 + %0 = arith.trunci %c20194_i16 : i16 to i8 + vector.print %c20194_i16 : i16 + vector.print %0 : i8 + return + } + + func.func @entry() { + + // CHECK: 1 + // CHECK: -1 + func.call @extsiOnI1() : () -> () + + // CHECK: 1 + // CHECK: 1 + func.call @extuiOn1I1() : () -> () + + // CHECK: 20194 + // CHECK: -30 + func.call @trunciI16ToI8() : () -> () + + + return + } +} \ No newline at end of file ``````````
pingshiyu commented 5 months ago

@kuhar @Lewuathe @joker-eph @banach-space @math-fehr

pingshiyu commented 5 months ago

Nice! Quick question, what kind of bugs did you find in the canonicalization patterns and the lowerings? Do you have issues/PR for them already?

Also, is your fuzzer/semantics open-source somewhere? It would be interesting for me to see them, to compare them with the one I currently have for arith (https://github.com/opencompl/xdsl-smt/blob/main/xdsl_smt/semantics/arith_semantics.py)

Yep! The bugs found by this method are here (some are as of now unfixed! :o) https://github.com/llvm/llvm-project/issues/90296 (!!) https://github.com/llvm/llvm-project/issues/90238 (!!) https://github.com/llvm/llvm-project/issues/82788 (!!) https://github.com/llvm/llvm-project/issues/84986 (!!) https://github.com/llvm/llvm-project/issues/83079 https://github.com/llvm/llvm-project/issues/88732 https://github.com/llvm/llvm-project/issues/89382

The fuzzer/semantics are unfortunately not open source at the moment, but I do plan on making it so in the near future! There's a bit more I plan to work on before it's public :)

pingshiyu commented 5 months ago

I want to ask: what is the preferred format for these tests?

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

joker-eph commented 5 months ago

I want to ask: what is the preferred format for these tests?

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

Historically speaking, we are not relying to much on integration tests. I'm not against them on principle though, as long as they are tiny (fast to execute) and easy to maintain.

One important aspect though is that our test coverage does not rely on the integration tests: all of our canonicalization should be unit-tested.

banach-space commented 5 months ago

Thanks for sharing these, @pingshiyu !

I'm always in favour of improving our test coverage, but if we intend to add much more then it would be good to have some plan.

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:

Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

I find integration tests very helpful, but they tend to be expensive and hence hidden behind a CMake flag. Improving the coverage for Arith would be a welcome addition, but it would be good to have a strategy for selecting good test cases. To me, this is a very good reference:

IIUC, there's a case for every operation and every Op is verified with a set of inputs. I know that @Lewuathe has recently contributed there - would you have any recommendation for good cases to test?

pingshiyu commented 5 months ago

Thanks for sharing these, @pingshiyu !

I'm always in favour of improving our test coverage, but if we intend to add much more then it would be good to have some plan.

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:

* `@extsiOnI1`

* `@extuiOn1I1`

* `@trunciI16ToI8`

Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or as integration tests)? If so I'll add the rest of the tests into the PR too.

I find integration tests very helpful, but they tend to be expensive and hence hidden behind a CMake flag. Improving the coverage for Arith would be a welcome addition, but it would be good to have a strategy for selecting good test cases. To me, this is a very good reference:

* https://github.com/llvm/llvm-project/blob/main/mlir/test/Integration/Dialect/Complex/CPU/correctness.mlir

IIUC, there's a case for every operation and every Op is verified with a set of inputs. I know that @Lewuathe has recently contributed there - would you have any recommendation for good cases to test?

Definitely a good point on having the new tests be organised in a way that it's easy to identify gaps!

The current tests don't cover all cases, although I think it wouldn't be realistic to do so either - that would be a combinatorial explosion and is perhaps better explored using property-based testing :) For this PR, my aim is to just add all the tests that I have accrued from the interpreter exercise, and most of these were once a bug within the interpreter

In the most recent commit I've added the remainder of the tests as integration tests. I avoided the canonicalization.mlir tests, because like you pointed out, it is already rather large, unwieldy to go through, and there might well be duplicates.

I've assigned the tests a rough category for what they're testing, according to the file names added :)

kuhar commented 5 months ago

@banach-space

Another question that I'd ask - how far do we go with testing? Do we want to exercise every possible combination? For example, looking at "test-int-trunc-ext.mlir" that you added, there are the following 3 cases:

  • @extsiOnI1
  • @extuiOn1I1
  • @trunciI16ToI8

Why not more, i.e. other data types and operations? Once we consider all the possibilities, the potential set of all tests becomes really large. This can be a maintenance burden. Are we worried?

Also, if we are to add many more tests, it would be good to establish some structure so that it's easy to identify cases that are covered and to find cases that are missing. Otherwise, we tend to duplicate tests, I've seen examples in the Vector dialect, not sure about Arith. This can usually be improved with comments and consistent test function naming.

Are we happy with the regression test format right now (i.e. either as canonicalization tests, or a

Some good points here! IMO this specific PR falls under the broader umbrella of regression tests -- I think there's a lot of values in handling the corner cases we did not handle properly in the past. We should definitely have LIT test coverage there. In llvm it used to be much common to check in tests taken verbatim from bugzilla in individual regression test files. I wouldn't go probably go as far, and having the regression tests organized in some logical buckets makes sense to me. Similarly, exhaustively testing every permutation of types / operations is unlikely to add much value on it's own, but I wouldn't mind adding more bugs if we do discover that our coverage is insufficient and start seeing bugs being reported.

As for the integration tests, I think MLIR is really undertested. There's no in-tree frontend compiler (except for flang?) to truly exercise the conversions on real hardware and I think there's also value to have integration tests for conversions that may be tricky to get right (e.g., wide integer emulation) or easy to miscompile by the backend compiler. I think these extension tests fir in nicely here.

banach-space commented 5 months ago

Thank you all for replying to my comments!

1. Wider discussion on testing

In llvm it used to be much common to check in tests taken verbatim from bugzilla in individual regression test files.

I would prefer to avoid that. Instead, for every bug I'd rather identify the actual edge case that's:

In general, I don't quite follow the split into "regression" and regular tests (I also feel that folks interpret the split differently). From https://llvm.org/docs/TestingGuide.html#regression-tests

The regression tests are small pieces of code that test a specific feature of LLVM or trigger a specific bug in LLVM.

To me that sounds like a test for a specific (edge) case that "could go wrong". For completness, that same document defines "unit test" as:

Unit tests are written using Google Test and Google Mock and are located in the llvm/unittests directory.

But let's not diverge too much - we all agree that we need more testing and patches like this one 😅

Btw ...

There's no in-tree frontend compiler (except for flang?) to truly exercise the conversions on real hardware

We do have e2e tests for Linalg (and Vector and SparseTensor) that target SVE and SME. These tests are run by LLVM buildbots (there's more tests for other targets too, but I don't track that as closely). Glad to hear that people find these valuable :)

2. This patch

Test files

Now, looking at the files added in this PR:

I see two conflicting splits:

  1. one test file per operation (e.g. "test-shifts.mlir" and "test-int-trunc-ext.mlir"), vs
  2. one test file per data type (e.g. "test-i1-operations.mlir" and "test-indices.mlir").

For consistency, could you select one and stick with that? My personal preference would be one file per operation (or group of similar operations) and then exercise "most interesting" data types in every file (index, i1, i8 and f32)?

Also, IMHO you can safely skip "test" in file names that are located in ... a "test" sub-directory (to reduce noise). That's a nit ;-)

Review process - multiple PRs

Given that you are testing a lot of edge cases, I'd like to go over every one of them and make sure they agree with my expectations. I am rising this especially in the context of the following PR in which we discovered that our folding logic for arith::CeilDivSIOp "overflows" in a case where the Op itself shouldn't:

As in, I think it's good to review every case for such tests. I'm happy to do that, but I'd require this PR to be split into smaller PRs (one per Op?) - this one has grown quite quickly (btw, absolutely no harm in having multiple small PRs). I would definitely extract the canonicalization tests into a separate PR.

Importantly, if other reviewers are happy with the content then I'm not going to block it. You already have +1 from @kuhar who is an expert in this area.

Cases to test

I see three major cases being exercised in this PR (and in general in arith):

  1. "sane" inputs (e.g. 10 + 10 = 20)
  2. "overflow" inputs (e.g. -MIN_INT)
  3. "poison/non-poison" inputs
  4. ... you could also add nan

Why not split them along these axis (i.e. check 1., 2. and 3.) and either:

Btw,

The current tests don't cover all cases, although I think it wouldn't be realistic to do so either - that would be a combinatorial explosion and is perhaps better explored using property-based testing :)

Agreed. I'm only suggesting to identify the key options/cases and to make sure every test makes it clear what case is being exercised. In my view the coverage proposed here is a good start. But we should prepare these tests for folks who'd like to add more tests at some later time (and to make it easy for them).

Running integration tests on different architectures

In your integration tests you don't specify the architecture on which to run the tests. In ideal world, these should work regardless of the CPU target. Sadly, this is not an ideal world :) A while back, @Lewuathe found a rather unpleasant corner cases highlighting a very fine difference in how nan values are treated:

I doubt that that ^^^ would impact the tests in this PR, but just to clarify, given that no triple is specified - these tests are intended to be target agnostic, right? That's worth clarifying somewhere.

Final note

As I mentioned above, I won't block this PR. I wanted to share my ideas on the possible approach (I've approached this PR as an RFC). If you agree with my suggestions and decide to follow them, I'm happy to help with reviews. Most importantly, thank you for raising this very important topic, @pingshiyu !

Btw, I am OOO next week, so apologies if there's more delay from my side.

pingshiyu commented 4 months ago

@banach-space @kuhar Sorry for dropping the ball on this one last few weeks! I have been dealing with some personal issues, followed by a very busy period at work, but now I've found time to come back to this and get it merged! :)

Since I dropped the ball for so long I cannot ask for any urgency :) Please review when convenient for you!

Review process - multiple PRs

I think multiple PRs for this is a good idea since there are quite a few tests. I'll be splitting this one up into smaller ones which will soon follow. To avoid spamming you all with emails, I'll do 3 PRs initially.

Why not split them along these axis (i.e. check 1., 2. and 3.) and either: write a function for every op and pass different arguments depending on the case being checked (e.g. "sane" vs "overflow" vs "poison"), or write a separate function for 1., 2. and 3.

I don't understand what this means fully - would you be able to give an example to clarify a bit further? I understand this suggestion as a refactoring of test code, although I don't see a common pattern that would work for each class of tests.

kuhar commented 4 months ago

Thanks for splitting this up into smaller PRs -- I think it makes it easier to review.

banach-space commented 4 months ago

Sorry for dropping the ball on this one last few weeks!

Not at all - life gets busy 😅 Thanks for getting back to this!

I don't understand what this means fully - would you be able to give an example to clarify a bit further?

Sure. I was trying to suggest a strategy for selecting edge cases to test. Once a strategy is in place, it is easy to be consistent. And consistency makes things easy to review and maintain.

As a specific example, for arith.addi, I would test 2 edge cases:

  1. "Normal values" (e.g. 10 + 20) - this should "just work" ™️ (technically speaking, this would be more like a sanity check rather than an edge case)
  2. Values that would lead to overflows (e.g. INT_MAX + 1, INT_MIN - 1)

This could be tested as follows (pseudo MLIR):

func.func @add_i32(%arg_0: i32, %arg_1: i32) -> i32 {
  %res = arith.addi %arg_0 : i32, i32
  return %res : i32
}

%res_i32_30 = func.call @add_i32(10, 20)
// CHECK: 30
vector.print %res_i32_30

%res_i32_min = func.call @add_i32(INT_MAX, 1)
// CHECK: -2147483648
vector.print %res_i32_min

If you were to test FP operations, I'd add another edge case:

  1. Operations on nan and inf values.

Now, I am not asking you to cover all cases. However, please try to make it easy to:

kuhar commented 4 months ago

+1, this organization makes complete sense to me @banach-space.

pingshiyu commented 3 months ago

I've just pushed 4 new PRs for the next batch of the regressions - these should be the last! They're now in the same format we established over the last weeks.

@kuhar @banach-space please take a look at those when you have a chance :) (as well as https://github.com/llvm/llvm-project/pull/96974)

pingshiyu commented 3 months ago

@banach-space @kuhar bump - wondering if you could have another look at the new PRs when you have a chance, please? https://github.com/llvm/llvm-project/pull/98181 https://github.com/llvm/llvm-project/pull/98182 https://github.com/llvm/llvm-project/pull/98183 https://github.com/llvm/llvm-project/pull/98184 https://github.com/llvm/llvm-project/pull/96974

banach-space commented 3 months ago

@banach-space @kuhar bump - wondering if you could have another look at the new PRs when you have a chance, please? #98181 #98182 #98183 #98184 #96974

Sorry about the delay. I'm busy today, but otherwise should be able to prioritise these in the coming days. In the meantime, could you resolve the Git conflicts? Thanks!

pingshiyu commented 3 months ago

Thank you very much @banach-space ! The conflicts all appears to be on this particular PR - which I will not merge, and will close after the other 5 have been merged. This current PR contains all (~33) of the regressions - where the other 5 are smaller, broken-down PRs based on the format we established in the previous iteration.

banach-space commented 3 months ago

Thank you very much @banach-space ! The conflicts all appears to be on this particular PR - which I will not merge, and will close after the other 5 have been merged. This current PR contains all (~33) of the regressions - where the other 5 are smaller, broken-down PRs based on the format we established in the previous iteration.

Could I suggest closing this PR and creating an issue instead to track this work? This would help me filter. Thanks!

joker-eph commented 2 months ago

I commented on some individual PR, but since this is identified as the "umbrellas" for these I'll just repeat it here: I'd like to exercise caution with the amount of e2e tests, I'm wary and skeptical of developing too much of this for the maintenance and development health of the project: I'm afraid this is working around lack of unit-testing right now, or just testing trivial things that are already tested otherwise. The combined runtime over time of adding more and more of these does not seem negligible either.

banach-space commented 2 months ago

Hi @joker-eph , thanks for the feedback. These are very good points and I agree - we should prioritise unit testing over e2e tests where possible.

I don't really have enough architectural overview of Arith to say whether these particular tests fill a gap in our testing vs "working around lack of unit-testing right now". And I admit, when reviewing, I focus on the implementation details rather than the bigger picture. Input from folks with more experience with Arith would be greatly appreciated.

My overall steer for these e2e tests has been to keep them minimal and clearly structured so that we know exactly what is being tested. Once complete, we should probably audit them and make sure that similar functionality isn't already tested elsewhere (i.e. remove duplication). That shouldn't be too difficult give how cleanly @pingshiyu has organised them.

pingshiyu commented 2 months ago

@joker-eph @banach-space

Thank you for the comments!

On motivation: some of the existing tests do indeed test existing/historical bugs in MLIR, and others are motivated by bugs found in my independent implementation of MLIR semantics (which would be useful safeguards since those are bugs that may be tempting to introduce). From these, some extra tests were added for a more complete test suite (if they are very natural to add, e.g. true case if an existing one is testing the false case), and there's not an overwhelming number of them). I'll make sure to add more details in the comments about these where appropriate.

On integration vs. unit testing: there may indeed be some duplication here - like @banach-space suggested also audit these against existing unit tests once these are merged in! However, several of the bugs the tests cover are from lowering steps down to LLVM. To test the produced LLVM code, to the best of my knowledge, integration tests seem to be the way (comparing LLVM code directly, rather than executing them, would be a bit fragile imo), but would be happy to be corrected by someone who has deeper knowledge about the testing infrastructure.

As for runtime costs of these tests, I did some tests earlier today (personal laptop, default options, all integration tests), and altogether, the new tests add ~1 second to the total running time (263.67s with all new tests, 262.87s without, average taken over 3 runs).

joker-eph commented 2 months ago

However, several of the bugs the tests cover are from lowering steps down to LLVM. To test the produced LLVM code, to the best of my knowledge, integration tests seem to be the way (comparing LLVM code directly, rather than executing them, would be a bit fragile imo)

I think this is where we differ: the main way of testing LLVM (and MLIR) is with unit-tests checking the lowering through IR matching, not integration tests. These are more exceptional.

pingshiyu commented 2 months ago

@joker-eph for example, one that I was thinking of was this one: https://github.com/llvm/llvm-project/pull/83248

The bug was from a wrong lowering from Arith down to llvm. To prepare the test for this bug, we'd ultimately like a test for the functional "spec" of MLIR, containing a desired property: making sure the lowering produces the correct IR, and implementing the expected behaviour of the operation.

Of course, by spec, I don't mean a full formal spec, which doesn't exist, but rather a small aspect of it. here, it'd be the division behaving correctly on inputs that are at the boundary of int representation),

Whilst we could validate that the fold does indeed produce the desired IR, it doesn't seem to me that comparing the produced vs. expected IR is a super clear oracle (i.e. the expected IR being the algorithm implementing the fold using a lower-level IR). In this case, I think it is clearer to test the implemented algorithm, so we can check against the spec directly, e.g. by executing it, whether via a constant folder or via the llvm interpreter - if we are at the LLVM level. Moreover, there may be many valid lowerings for the same op, and such an IR-comparison oracle would also lead to more overhead during development.

Overall I think lowerings are where integration tests could provide quite a bit of value!

joker-eph commented 2 months ago

Since you're reasoning in term of "spec", we indeed think in terms of what the spec for floordivsi should be, and then we reason in terms of what the spec of the target of the lowering is. For example for expand ops it is the spec of the other arith op present here. Any kind of reasoning should be able to be done there: if we lower to LLVM we reason in term of the spec of the abstract LLVM machine and check that we program the right sequence of instruction as expected.

In this case, I think it is clearer to test the implemented algorithm, so we can check against the spec directly, e.g. by executing it,

You're not testing the implementation or the spec, just a particular behavior (the test may be UB but lucky for example), on some specific host and version of LLVM, against a single input. This is providing less coverage than an IR check from this point of view.

More importantly, this is also a question of coupling of the system and overall maintenance / cost question, and ultimately some testing philosophy (for which LLVM is historically fairly clear on the preference to IR lit-testing: you won't find any execution unit testing in LLVM like this outside of MLIR).

pingshiyu commented 2 months ago

@joker-eph Perhaps spec is the wrong word I used since we don't have full specs to begin with :) Maybe a better word is "expected behaviour"

Comparing IRs directly does test for more "things" in the sense that: the implementation is "equal" to some expected implementation. However, in the case linked above, I don't think it is checking against the "thing" we're concerned with.

An analogy that comes to mind, especially for the lowering bug mentioned, is akin to verifying a travelling salesman implementation by checking that the C code matches some expected implementation. Even then, we wouldn't be certain that the expected implementation is correct. A complementary approach is to run the C code on some small examples (ideally using a platform-independent tool that correctly implements C semantics, like CompCert). This approach also results in more comprehensible tests.

Similarly, comparing IRs doesn’t automatically instil confidence that the top-level IR exhibits the same expected behaviour as the lower-level IR. For instance, it's challenging (at least for me) to look at LLVM bytecode and determine with certainty that it will perform as intended for all inputs.

It's interesting to compare against LLVM too, since I think this discussion would only come about in MLIR due to the extensive lowerings.

joker-eph commented 2 months ago

Even then, we wouldn't be certain that the expected implementation is correct.

Right, but that's "separation of concerns". It's all what unit-testing is about: we check for a component boundaries based on the expectation at these boundaries.

A complementary approach is to run the C code on some small examples

Right, I understand, but as as I mentioned before: we just don't do that historically in how LLVM compiler tests is set up.

It's interesting to compare against LLVM too, since I think this discussion would only come about in MLIR due to the extensive lowerings.

I personally don't see a different between the expand-math in MLIR and the same thing on LLVM IR though.

banach-space commented 2 months ago

I think this is where we differ: the main way of testing LLVM (and MLIR) is with unit-tests checking the lowering through IR matching, not integration tests. These are more exceptional.

I agree with Mehdi.

We (Arm folks working on SVE and SME support) tend to use e2e tests for rather complex examples (e.g. linalg.matmul) - but even those are mostly used as smoke tests. e2e tests are super helpful when evaluating complex compilation pipelines, but that's "integration" testing and should not be treated as a replacement for "unit" testing. That's how I look at the tests being added here.

On motivation: some of the existing tests do indeed test existing/historical bugs in MLIR

From my experience, such tests/bugs are symptoms of insufficient unit test coverage. But rather than using reproducers as is (these tend to be relatively complex), we should try to identify the corresponding edge case that has been missed and test that explicitly (it's much easier to reason about a test once the underlying edge case is clear).

As for runtime costs of these tests, I did some tests earlier today (personal laptop, default options, all integration tests), and altogether, the new tests add ~1 second to the total running time (263.67s with all new tests, 262.87s without, average taken over 3 runs).

This is re-assuring, thanks for checking. Still, we should make sure that we don't encourage folks to add too many new cases. My initial steer was to test all data types, but now I realise that we should avoid that.

For instance, it's challenging (at least for me) to look at LLVM bytecode and determine with certainty that it will perform as intended for all inputs.

Please note that we will never run tests for all possible inputs ;-)

Just to summarise, I do find these new e2e tests helpful, but lets limit the scope and focus on a small set of interesting cases. @pingshiyu , what other Ops do you intend to test? Once you finish, we could consolidate all tests into one file. We can use correctnes.mlir as an indication (roughly) of what the scope should be.

pingshiyu commented 2 months ago

@joker-eph

Right, but that's "separation of concerns". It's all what unit-testing is about: we check for a component boundaries based on the expectation at these boundaries.

I do agree that the scope of tests should be limited, but I think where we differ is what the "expectation" should be. Take the example of the lowering for instance: we want the test to give us confidence that translated code is "correct". https://github.com/llvm/llvm-project/pull/83248/files#diff-71577b98cdb92d5eb7583c3c3d5b9488f017f8f56dc6d05b7b940848f6e0bfc9 However, checking that "the translated code is equal to what we expect" does not give us confidence in correctness here - it is only checking the rewrite pattern has done its job. It is not checking that the lowering is correct, which is desirable too for the pass. I don't think the linked unit-test, by itself, gives us much confidence that the bug has even been fixed: we'd need to run the code (mentally, or via interpreter) on the problem input described in the bug report to be sure.

I believe it's clearly desirable to test the functionality of the lowering, though. And I don't see how this can be done without something to the effect of an integration test.

@banach-space the ones referenced in https://github.com/llvm/llvm-project/issues/100121 would be all the ops I intend to test in this work!

joker-eph commented 2 months ago

However, checking that "the translated code is equal to what we expect" does not give us confidence in correctness here - it is only checking the rewrite pattern has done its job.

I don't quite follow the difference you're trying to make.

I believe it's clearly desirable to test the functionality of the lowering, though. And I don't see how this can be done without something to the effect of an integration test.

I feel we're running in circle: we're checking this because the IR you generate to has a spec and you check manually the equivalence by inspecting the IR ("running mentally" is what you wrote). Again virtually the entirety of transformations in LLVM are developed that way ~ forever, we could change our testing approach of course, but IMO that's an RFC for the LLVM discourse.

In the meantime, tests like this are out-of-scope of what I expect to see here.

pingshiyu commented 2 months ago

@joker-eph

I don't quite follow the difference you're trying to make.

We want unit tests to protect against previous bugs, and also to serve as documentation of the expected behaviour.

For lowering passes of MLIR, bugs aren't caused by the lowering rewrite not being applied properly, but because the rewrite pattern itself is incorrect.

A unit test that applies a pattern and checks the resulting IR doesn't check the pattern itself is correct. This is what an integration/execution test would do.

Going back to the example of floordiv, the unit test goes something like:

// CHECK:   %[[QUOTIENT:.*]] = arith.divsi %arg0, %arg1 : i32
// CHECK:   %[[PRODUCT:.*]] = arith.muli %[[QUOTIENT]], %arg1 : i32
// ... etc

i.e. just describes the pattern itself - and this is, really, the only test that can be written for the lowering pattern.

If this is all the testing we have for the lowering, then the issues I see are:

An integration test would resolve the above issues for lowering passes.

I'm not proposing to change how LLVM transformations are developed. But for MLIR's lowering passes, there's a real need for integration tests like this - not to replace how unit testing is done for all transformations.

There may be a way to do what I've described above as a unit test, in which case please let me know! Also, let me know if I wasn't clear anywhere :)

joker-eph commented 2 months ago

I'm not proposing to change how LLVM transformations are developed.

Not developed, just tested.

But for MLIR's lowering passes,

I don't see any difference right now between the example at hand here for MLIR and most transformations in LLVM.

pingshiyu commented 2 months ago

@joker-eph

I don't see any difference right now between the example at hand here for MLIR and most transformations in LLVM.

The difference is that for the majority of cases, there is only one unit test that can be written for the lowering pattern: basically the test is the lowering pattern itself (i.e. the example shown). This is a problem for lowering passes only and not for other transformations.

Because of this, it results in insufficient testing for the lowering passes imo.

joker-eph commented 2 months ago

This is a problem for lowering passes only and not for other transformations.

Why isn't it the case for a canonicalization pattern?

Also what is different between LLVM instruction selection and a MLIR lowering pattern?

pingshiyu commented 2 months ago

Why isn't it the case for a canonicalization pattern?

You're right, unit tests don't test the correctness of any patterns - but integration tests would.

Pattern correctness can also be a problem for canonicalization patterns, I remember quite a few bugs due to incorrect canonicalizers.

But lowerings are easier to test in an integration test - it's easy to control when they're applied and they're applied in isolation.

OTOH multiple canonicalization patterns may apply and are also applied greedily, so IMO it's harder to isolate them and write integration tests for (correct me if I'm wrong!).

Also what is different between LLVM instruction selection and a MLIR lowering pattern?

I'm not very familiar with how they're tested so I cannot compare the two, sorry!

banach-space commented 2 months ago

In the meantime, tests like this are out-of-scope of what I expect to see here.

Thanks for this feedback, Mehdi!

I feel that we are missing some guidelines on what the expected scope of integration tests should be. My high-level suggestion (based on this discussion) would be that integration tests should be reserved for cases much more complex than e.g. (extracted from comparison.mlir that Mehdi referred to uptrhead):

Admittedly, that's not particularly involved. However, examples from expand-ops.mlir (including arith.floordivsi that @pingshiyu mentioned) seem to be more complex and perhaps we should port those as e2e tests instead?

I am happy to work on a proposal, but will be OoO next two weeks (perhaps somebody else is free to volunteer?). In the meantime, we should probably pause this effort.

A unit test that applies a pattern and checks the resulting IR doesn't check the pattern itself is correct.

I don't follow this comment. Unit tests are there specifically to verify that the corresponding pattern is correct. In addition, every pattern is verified through code-review.

checking that "the translated code is equal to what we expect" does not give us confidence in correctness here

IMHO, in the example that you referred to, one "magic" formula for expanding floordvisi was replaced with another. IIUC, the only thing that was incorrect was the original formula. Perhaps the real issue here is that we are missing some formal spec for these things?

An integration test would resolve the above issues for lowering passes.

I am not convinced. I can see how an integration test would prevent us from hitting #83079, but we will never have an e2e for every possible scenario. That's just not feasible and against the grain of MLIR and LLVM. As for #83079, IMHO, we were (and still are) missing a formal spec for expanding floordivsi. An e2e would/will help, yes, but should be considered as complementary to a spec rather than a replacement.

While it sounds like we probably should revert some of the newly added e2e tests, I feel that overall we are making important progress in terms of test coverage for Arith. Thank you @pingshiyu !

pingshiyu commented 2 months ago

@banach-space thanks for chiming in!

I don't follow this comment. Unit tests are there specifically to verify that the corresponding pattern is correct. In addition, every pattern is verified through code-review.

As far as I can see, unit tests don't verify that a pattern is correct: I mean correct in the sense that for a pattern source -> target, source and target have the same semantics, in all contexts. Unit tests only check that the pattern is applied. Integration tests can act as tests for the pattern's correctness though (of course they're not proof for the pattern's correctness, but serve as unit tests for patterns).

An e2e would/will help, yes, but should be considered as complementary to a spec rather than a replacement.

Oh yes absolutely! It was never the intention to replace formal semantics with e2e tests :) During working on this PR and conversation in this thread I realised that integration tests could fill in some testing gaps in the correctness of rewrite patterns, that are not covered by unit tests.

IIUC, the only thing that was incorrect was the original formula. Perhaps the real issue here is that we are missing some formal spec for these things?

Indeed the formula (or, the correctness of the lowering pattern) is the part that wasn't able to be tested by a unit test - and an e2e test would be able to provide test coverage for it. In the absence of formal semantics, and for potential complex lowerings where formal semantics might be absent/arrive slowly, I think e2e tests could, like you said, be complementary to unit tests (they test orthogonal things) :)

I am actually currently developing semi-formal semantics in Haskell for some MLIR operations. The integration tests in this suite were all "contentious" cases which are bugs on either the Haskell side during development or the MLIR side - an indication that these cases are "complex" and have the potential to go wrong. The sources of these bugs are mostly unsound optimisation patterns and lowerings - parts which aren't covered by MLIR unit testing

banach-space commented 2 months ago

As far as I can see, unit tests don't verify that a pattern is correct: I mean correct in the sense that for a pattern source -> target, source and target have the same semantics, in all contexts.

Thanks for clarifying, now I see what you meant. Still, I think that one of Mehdi's previous comments is quite fitting here:

Again virtually the entirety of transformations in LLVM are developed that way ~ forever, we could change our testing approach of course, but IMO that's an RFC for the LLVM discourse.

I will paraphrase - you are proposing a different approach towards testing compared to what's normally done in LLVM/MLIR. We could change that, sure, but it's a discussion beyond Arith and MLIR, and more suitable for LLVM Discourse.

Indeed the formula (or, the correctness of the lowering pattern) is the part that wasn't able to be tested by a unit test - and an e2e test would be able to provide test coverage for it.

Right, but I don't believe that we should be using MLIR integration tests to verify that a particular Mathematical formula is correct. Instead, we should select a formula that is known to be correct (surely there are some good references) and implement that. Testing in MLIR should focus on making sure that the implementation itself is correct, not the formula. Does it make sense?

Having said that, IMHO, it would be totally fine to have e2e for things like floordivsi. That's a good example of "this is complex and let's complement unit tests with integration tests".

I am actually currently developing semi-formal semantics in Haskell for some MLIR operations.

Exciting! It would be interesting to see how you make sure that the semantics in Haskell match MLIR (and whether there's scope for writing a test-suite). That's a completely different thread though 😅

pingshiyu commented 2 months ago

@banach-space @joker-eph

We could change that, sure, but it's a discussion beyond Arith and MLIR, and more suitable for LLVM Discourse.

Yes, thanks for highlighting this. I do understand now that this is a deviation from what people are used to. I'm not very familiar with tests on the LLVM side.

Maybe we could work together on the integration testing guidelines proposal, as we seem to want changes in similar directions :) What other ways can I contact you to discuss this further? I am on Discord as well and my email is on my profile.

I have concerns about raising this with the whole LLVM community because it is quite a substantial scope expansion compared to what I had in mind: I thought integration tests would be useful for rewrites in MLIR, and this PR is initiating that for Arith.

There's a focus on lowering patterns here as they are easier to test using integration tests (hard to control what canonicalization patterns get applied) - and the extensive use of lowerings is characteristic of MLIR rather than LLVM.

we should select a formula that is known to be correct (surely there are some good references) and implement that.

I don't quite agree with this in the context of general testing approaches for MLIR. We can't expect to always have a well-known, or verified formula for all rewrite patterns. Even then, all rewrite patterns are implementations of some formula - and implementations can have bugs, which would benefit from validation using integration tests (since bugs in the patterns cannot be detected by unit tests).

(Note by validation I mean a unit-testing-like approach rather than an exhaustive test suite for all cases: AFAIK patterns don't have any unit testing at all currently, and the question of whether they are correct, or correctly implement some formula, hinges almost entirely upon code reviews)

how you make sure that the semantics in Haskell match MLIR (and whether there's scope for writing a test-suite).

It's funny you mention this! xD The way I am making sure the semantics match is concurrently developing a fuzzer to generate MLIR programs (e.g. lots of random integration tests), and comparing the semantics against the MLIR compilation results.

This very test suite in this PR actually comes from this process. Every test case here is a bug in either MLIR or in the Haskell semantics - a place where they disagreed at some point during development :)

Finally, for this particular series of PRs (integration tests for Arith), I think the following are the value add:

But we should probably return to this once we get some feedback from the wider MLIR community on testing approaches :)

banach-space commented 2 months ago

Just a couple of clarifications.

Maybe we could work together on the integration testing guidelines proposal, as we seem to want changes in similar directions :)

What I had in mind in my earlier post was merely updating the MLIR docs to clarify what the expected scope of e2e tests should be.

I think that you want to expand the scope of such tests. I am not convinced that that's the right approach.

I don't quite agree with this in the context of general testing approaches for MLIR. We can't expect to always have a well-known, or verified formula for all rewrite patterns.

I was referring specifically to floordivsi as your motivating example and this formula/expansion.

banach-space commented 3 weeks ago

@joker-eph and @pingshiyu, sorry for the delay.

What I had in mind in my earlier post was merely updating the MLIR docs to clarify what the expected scope of e2e tests should be.

Please take a look: https://github.com/llvm/mlir-www/pull/203. Once the docs are updated, we should review the newly added e2e tests and remove cases already tested through other means.