google / benchmark

A microbenchmark support library
Apache License 2.0
8.69k stars 1.59k forks source link

Simplify assembly tests #1574

Closed HFTrader closed 1 year ago

HFTrader commented 1 year ago

I would like to propose the following changes in the assembly tests with the goal to simplify their maintenance.

The changes are local and should not affect any other tests or code.

The first change under this proposal is moving the underlying tooling from COMPILER+PYTHON to OBJDUMP+SED. The advantages are:

  1. objdump's generated assembly is standard no matter what compiler is being used - because it is coming back from object file. This will simplify support to any future compiler we decide to adopt.

  2. the "cleaning" step is not necessary anymore since we do not have to deal with each compiler's idiosyncrasies. From objdump's disassembly it is pretty much a couple of simple regex rules to clean addresses and offsets.

  3. the way we were generating assembly code by injecting an "-S" flag to the end of the compiler's command line is supported on clang and gcc only. This combination of "-S" and "-o" on NVC++ generated IR language source, not assembly. Granted, clang/gcc are the only officially supported compilers but in the current proposal we do not need to touch on the compiler's command line at all. By using objdump we will be future-proofing the process both in terms of new versions and new compilers.

  4. By using SED we can aggregate all the regex rules under a very neat SED script under /tools/strip_asm.sed, which is very easy to understand and maintain.

The second change is to normalize all instruction and register names. Instead of matching %eax we will be matching REG. Instead of matching -0xc(%rip) we will be matching OFFSET(RIP). This normalization is done cleanly in the SED script located in tools/strip_asm.sed.

Keep in mind that in the context of these tests we are not interested in the exact assembly as the original requirement that FileCheck was created for in the LLVM project. Most of the time we are interested in knowing that a given assembly instruction was generated. The specific register used or if it uses memory, a register or indirect addressing is irrelevant in our scenarios.

This normalization comes in handy as we unlock another level of complexity. For example, the GCC compiler generates inc %eax while clang generates add 1,%eax and nvidia hpc writes add 1,0xc(%rdi). We can match all these three versions with a single rule CHECK: {{inc |add 1,}}{{REG|OFFSET(REG}}} instead of creating three separate sets of rules for each compiler. This is seen in test/donotoptimize_assembly_test.cc:179.

The third change is to add levels of detail to the check rules. We can now check for all the compilers with CHECK:, check for a specific compiler with CHECK-CLANG: or CHECK-GNU:, check for a major version with CHECK-CLANG-12: and even check for major and minor releases as CHECK-CLANG-5_0. This allows flexibility in creating exceptions to the rule.

The fourth change is that this commit fully supports the NVIDIA HPC compiler nvc++. The tests should pass on clang++, g++ and nvc++.

As this is a proof of concept at this point, I have modified only the donotoptimize_assembly_test.cc test, there is still clobber_memory_assembly_test.cc and state_assembly_test.cc to rewrite the rules.

Thank you!

dmah42 commented 1 year ago

this seems like a very sensible step to making these more maintainable, which would be great. they don't fail often, but when they do it's an issue we need to understand.

dmah42 commented 1 year ago

i approved, but i assume you want to go back and fix the remaining tests?

HFTrader commented 1 year ago

i approved, but i assume you want to go back and fix the remaining tests?

Absolutely. Let me work on that.

dmah42 commented 1 year ago

i approved, but i assume you want to go back and fix the remaining tests?

Absolutely. Let me work on that.

ready? :)

HFTrader commented 1 year ago

i approved, but i assume you want to go back and fix the remaining tests?

Absolutely. Let me work on that.

ready? :)

Yes - to some extent. The newer compilers (all three of them actually) do not follow at all the pattern that was in the "standard" tests. It looks like there is a massive change in behavior.

Look at the second function in state_assembly_test.cc. You can hardly guess where the internal loop is.

0000000000000080 <test_while_loop>:
  80:             push   %rbx
  81:             sub    $0x10,%rsp
  85:         /-- callq  8a <test_while_loop+0xa>
  8a:         \-> mov    %rax,%rbx
  8d:             movl   $0x2a,0xc(%rsp)
  95:         /-- jmp    a6 <test_while_loop+0x26>
  97:         |   nopw   0x0(%rax,%rax,1)
  a0:   /-----|-> dec    %rax
  a3:   |     |   mov    %rax,(%rbx)
  a6:   |     \-> mov    (%rbx),%rax
  a9:   |         test   %rax,%rax
  ac:   +-------- jg     a0 <test_while_loop+0x20>
  ae:   |         cmpb   $0x0,0x18(%rbx)
  b2:   |  /----- jne    ca <test_while_loop+0x4a>
  b4:   |  |      mov    %rbx,%rdi
  b7:   |  |  /-- callq  bc <test_while_loop+0x3c>
  bc:   |  |  \-> cmpb   $0x0,0x1a(%rbx)
  c0:   |  +----- jne    ca <test_while_loop+0x4a>
  c2:   |  |      mov    (%rbx),%rax
  c5:   |  |      test   %rax,%rax
  c8:   \--|----- jg     a0 <test_while_loop+0x20>
  ca:      \----> mov    %rbx,%rdi
  cd:         /-- callq  d2 <test_while_loop+0x52>
  d2:         \-> mov    $0x65,%eax
  d7:             add    $0x10,%rsp
  db:             pop    %rbx
  dc:             retq   

So while I understand the use of assembly tests in donotoptimize_assembly_test.cc - namely to see if the instruction does not get optimized away - I am having a hard time to comprehend what is the goal with the state test.

dmah42 commented 1 year ago

i approved, but i assume you want to go back and fix the remaining tests?

Absolutely. Let me work on that.

ready? :)

Yes - to some extent. The newer compilers (all three of them actually) do not follow at all the pattern that was in the "standard" tests. It looks like there is a massive change in behavior.

Look at the second function in state_assembly_test.cc. You can hardly guess where the internal loop is.

0000000000000080 <test_while_loop>:
  80:             push   %rbx
  81:             sub    $0x10,%rsp
  85:         /-- callq  8a <test_while_loop+0xa>
  8a:         \-> mov    %rax,%rbx
  8d:             movl   $0x2a,0xc(%rsp)
  95:         /-- jmp    a6 <test_while_loop+0x26>
  97:         |   nopw   0x0(%rax,%rax,1)
  a0:   /-----|-> dec    %rax
  a3:   |     |   mov    %rax,(%rbx)
  a6:   |     \-> mov    (%rbx),%rax
  a9:   |         test   %rax,%rax
  ac:   +-------- jg     a0 <test_while_loop+0x20>
  ae:   |         cmpb   $0x0,0x18(%rbx)
  b2:   |  /----- jne    ca <test_while_loop+0x4a>
  b4:   |  |      mov    %rbx,%rdi
  b7:   |  |  /-- callq  bc <test_while_loop+0x3c>
  bc:   |  |  \-> cmpb   $0x0,0x1a(%rbx)
  c0:   |  +----- jne    ca <test_while_loop+0x4a>
  c2:   |  |      mov    (%rbx),%rax
  c5:   |  |      test   %rax,%rax
  c8:   \--|----- jg     a0 <test_while_loop+0x20>
  ca:      \----> mov    %rbx,%rdi
  cd:         /-- callq  d2 <test_while_loop+0x52>
  d2:         \-> mov    $0x65,%eax
  d7:             add    $0x10,%rsp
  db:             pop    %rbx
  dc:             retq   

So while I understand the use of assembly tests in donotoptimize_assembly_test.cc - namely to see if the instruction does not get optimized away - I am having a hard time to comprehend what is the goal with the state test.

the goal of the test was, for a fixed toolchain, to identify any changes that were introduced that would cause the compiler to generate a slower set of instructions. of course, the toolchains themselves can introduce changes that would cause slower sets of instructions (though you'd hope faster), so maybe the correct way to do this is to allow the toolchains to move and to assess if a new toolchain requires us to write the code differently.

all of this was triggered by a pathological case where a "simple" change to State was made (i don't remember exactly what) that caused the inner loop to get much slower as some memory access got hoisted into the inner loop instead of being in a register (iirc).

so i guess i have two questions:

  1. is there a different way that we can protect against the above concern?
  2. is the current behaviour you're seeing slower or faster than the existing "expectations" in the test?
HFTrader commented 1 year ago

so i guess i have two questions:

  1. is there a different way that we can protect against the above concern?

My first reaction is to have a loop with something of a known cost - zero is a cost for that purpose, perhaps just a donotoptimize with a static variable - and then assert the resulting number of cycles. Cycles is more stable than nanoseconds as you take away the cpu frequency from the equation.

  1. is the current behaviour you're seeing slower or faster than the existing "expectations" in the test?

That is an intriguing question - I have not timed it. It looks slower because it is looping with call/ret pairs! I'm actually still at a loss with that behavior - I dont know if it is resulting from the code or it's actually a compiler technique. I am leaning towards the second because all three compilers presented the same.

I can try to checkout an old branch and see what's breeding. I actually had a branch comparison project ready some time ago to quench that reviewer's concerns (on the perfcounter group leader change) but it ends up it wasn't necessary. I can use this tool - I will post the results here.

HFTrader commented 1 year ago

It is obvious to me that any such test needs to correlate the new value with the historical average. We could (1) maintain the average somewhere in a database but that will require that these tests be run in the exact same machine under the exact same conditions. Or (2) we could checkout all the versions and compute the historical average on the fly. The second is more time and disk consuming but it will be more reliable since the results by definition will be computed on exactly the same conditions.

I dont see much of a regression in any of these tests. image

Have a look here to see if this is something that fits. If so, I can push into benchmark.

dmah42 commented 1 year ago

any test that relies on timing is always going to be vulnerable to the system used to run the test (and the state of that system). this is why we determined that static analysis of the generated code was the more appropriate methodology.

it would be reasonable to have a set of "expectations" for benchmarks that we run on CI and if the PR is significantly slower than the known good that we flag it. i'm not sure how reproducible this is on github workers though, and i fear the noise may drown out the signal.

perhaps this is useful as a "pre-release" check. running this before a release would flag (late) if a release is about to significantly impact expected results. i think this makes for a very nice tool.

i'm internally debating if we had this if we could also drop assembly tests altogether. i'd want @EricWF to weigh in.

HFTrader commented 1 year ago

i'm internally debating if we had this if we could also drop assembly tests altogether. i'd want @EricWF to weigh in.

I believe the donotoptimize assembly tests are great, as well as the clobber. The "state" tests is the one that's off the grid.

Please let me know what you guys decide. I am here just to help!

HFTrader commented 1 year ago

I'm going to pull this off for now, I think this needs more thinking.