Closed HFTrader closed 1 year ago
this seems like a very sensible step to making these more maintainable, which would be great. they don't fail often, but when they do it's an issue we need to understand.
i approved, but i assume you want to go back and fix the remaining tests?
i approved, but i assume you want to go back and fix the remaining tests?
Absolutely. Let me work on that.
i approved, but i assume you want to go back and fix the remaining tests?
Absolutely. Let me work on that.
ready? :)
i approved, but i assume you want to go back and fix the remaining tests?
Absolutely. Let me work on that.
ready? :)
Yes - to some extent. The newer compilers (all three of them actually) do not follow at all the pattern that was in the "standard" tests. It looks like there is a massive change in behavior.
Look at the second function in state_assembly_test.cc. You can hardly guess where the internal loop is.
0000000000000080 <test_while_loop>:
80: push %rbx
81: sub $0x10,%rsp
85: /-- callq 8a <test_while_loop+0xa>
8a: \-> mov %rax,%rbx
8d: movl $0x2a,0xc(%rsp)
95: /-- jmp a6 <test_while_loop+0x26>
97: | nopw 0x0(%rax,%rax,1)
a0: /-----|-> dec %rax
a3: | | mov %rax,(%rbx)
a6: | \-> mov (%rbx),%rax
a9: | test %rax,%rax
ac: +-------- jg a0 <test_while_loop+0x20>
ae: | cmpb $0x0,0x18(%rbx)
b2: | /----- jne ca <test_while_loop+0x4a>
b4: | | mov %rbx,%rdi
b7: | | /-- callq bc <test_while_loop+0x3c>
bc: | | \-> cmpb $0x0,0x1a(%rbx)
c0: | +----- jne ca <test_while_loop+0x4a>
c2: | | mov (%rbx),%rax
c5: | | test %rax,%rax
c8: \--|----- jg a0 <test_while_loop+0x20>
ca: \----> mov %rbx,%rdi
cd: /-- callq d2 <test_while_loop+0x52>
d2: \-> mov $0x65,%eax
d7: add $0x10,%rsp
db: pop %rbx
dc: retq
So while I understand the use of assembly tests in donotoptimize_assembly_test.cc
- namely to see if the instruction does not get optimized away - I am having a hard time to comprehend what is the goal with the state test.
i approved, but i assume you want to go back and fix the remaining tests?
Absolutely. Let me work on that.
ready? :)
Yes - to some extent. The newer compilers (all three of them actually) do not follow at all the pattern that was in the "standard" tests. It looks like there is a massive change in behavior.
Look at the second function in state_assembly_test.cc. You can hardly guess where the internal loop is.
0000000000000080 <test_while_loop>: 80: push %rbx 81: sub $0x10,%rsp 85: /-- callq 8a <test_while_loop+0xa> 8a: \-> mov %rax,%rbx 8d: movl $0x2a,0xc(%rsp) 95: /-- jmp a6 <test_while_loop+0x26> 97: | nopw 0x0(%rax,%rax,1) a0: /-----|-> dec %rax a3: | | mov %rax,(%rbx) a6: | \-> mov (%rbx),%rax a9: | test %rax,%rax ac: +-------- jg a0 <test_while_loop+0x20> ae: | cmpb $0x0,0x18(%rbx) b2: | /----- jne ca <test_while_loop+0x4a> b4: | | mov %rbx,%rdi b7: | | /-- callq bc <test_while_loop+0x3c> bc: | | \-> cmpb $0x0,0x1a(%rbx) c0: | +----- jne ca <test_while_loop+0x4a> c2: | | mov (%rbx),%rax c5: | | test %rax,%rax c8: \--|----- jg a0 <test_while_loop+0x20> ca: \----> mov %rbx,%rdi cd: /-- callq d2 <test_while_loop+0x52> d2: \-> mov $0x65,%eax d7: add $0x10,%rsp db: pop %rbx dc: retq
So while I understand the use of assembly tests in
donotoptimize_assembly_test.cc
- namely to see if the instruction does not get optimized away - I am having a hard time to comprehend what is the goal with the state test.
the goal of the test was, for a fixed toolchain, to identify any changes that were introduced that would cause the compiler to generate a slower set of instructions. of course, the toolchains themselves can introduce changes that would cause slower sets of instructions (though you'd hope faster), so maybe the correct way to do this is to allow the toolchains to move and to assess if a new toolchain requires us to write the code differently.
all of this was triggered by a pathological case where a "simple" change to State
was made (i don't remember exactly what) that caused the inner loop to get much slower as some memory access got hoisted into the inner loop instead of being in a register (iirc).
so i guess i have two questions:
so i guess i have two questions:
- is there a different way that we can protect against the above concern?
My first reaction is to have a loop with something of a known cost - zero is a cost for that purpose, perhaps just a donotoptimize with a static variable - and then assert the resulting number of cycles. Cycles is more stable than nanoseconds as you take away the cpu frequency from the equation.
- is the current behaviour you're seeing slower or faster than the existing "expectations" in the test?
That is an intriguing question - I have not timed it. It looks slower because it is looping with call/ret pairs! I'm actually still at a loss with that behavior - I dont know if it is resulting from the code or it's actually a compiler technique. I am leaning towards the second because all three compilers presented the same.
I can try to checkout an old branch and see what's breeding. I actually had a branch comparison project ready some time ago to quench that reviewer's concerns (on the perfcounter group leader change) but it ends up it wasn't necessary. I can use this tool - I will post the results here.
It is obvious to me that any such test needs to correlate the new value with the historical average. We could (1) maintain the average somewhere in a database but that will require that these tests be run in the exact same machine under the exact same conditions. Or (2) we could checkout all the versions and compute the historical average on the fly. The second is more time and disk consuming but it will be more reliable since the results by definition will be computed on exactly the same conditions.
I dont see much of a regression in any of these tests.
Have a look here to see if this is something that fits. If so, I can push into benchmark.
any test that relies on timing is always going to be vulnerable to the system used to run the test (and the state of that system). this is why we determined that static analysis of the generated code was the more appropriate methodology.
it would be reasonable to have a set of "expectations" for benchmarks that we run on CI and if the PR is significantly slower than the known good that we flag it. i'm not sure how reproducible this is on github workers though, and i fear the noise may drown out the signal.
perhaps this is useful as a "pre-release" check. running this before a release would flag (late) if a release is about to significantly impact expected results. i think this makes for a very nice tool.
i'm internally debating if we had this if we could also drop assembly tests altogether. i'd want @EricWF to weigh in.
i'm internally debating if we had this if we could also drop assembly tests altogether. i'd want @EricWF to weigh in.
I believe the donotoptimize assembly tests are great, as well as the clobber. The "state" tests is the one that's off the grid.
Please let me know what you guys decide. I am here just to help!
I'm going to pull this off for now, I think this needs more thinking.
I would like to propose the following changes in the assembly tests with the goal to simplify their maintenance.
The changes are local and should not affect any other tests or code.
The first change under this proposal is moving the underlying tooling from COMPILER+PYTHON to OBJDUMP+SED. The advantages are:
objdump's generated assembly is standard no matter what compiler is being used - because it is coming back from object file. This will simplify support to any future compiler we decide to adopt.
the "cleaning" step is not necessary anymore since we do not have to deal with each compiler's idiosyncrasies. From objdump's disassembly it is pretty much a couple of simple regex rules to clean addresses and offsets.
the way we were generating assembly code by injecting an "-S" flag to the end of the compiler's command line is supported on clang and gcc only. This combination of "-S" and "-o" on NVC++ generated IR language source, not assembly. Granted, clang/gcc are the only officially supported compilers but in the current proposal we do not need to touch on the compiler's command line at all. By using objdump we will be future-proofing the process both in terms of new versions and new compilers.
By using SED we can aggregate all the regex rules under a very neat SED script under
/tools/strip_asm.sed
, which is very easy to understand and maintain.The second change is to normalize all instruction and register names. Instead of matching
%eax
we will be matching REG. Instead of matching -0xc(%rip) we will be matching OFFSET(RIP). This normalization is done cleanly in the SED script located in tools/strip_asm.sed.Keep in mind that in the context of these tests we are not interested in the exact assembly as the original requirement that FileCheck was created for in the LLVM project. Most of the time we are interested in knowing that a given assembly instruction was generated. The specific register used or if it uses memory, a register or indirect addressing is irrelevant in our scenarios.
This normalization comes in handy as we unlock another level of complexity. For example, the GCC compiler generates
inc %eax
while clang generatesadd 1,%eax
and nvidia hpc writesadd 1,0xc(%rdi)
. We can match all these three versions with a single ruleCHECK: {{inc |add 1,}}{{REG|OFFSET(REG}}}
instead of creating three separate sets of rules for each compiler. This is seen intest/donotoptimize_assembly_test.cc:179
.The third change is to add levels of detail to the check rules. We can now check for all the compilers with
CHECK:
, check for a specific compiler withCHECK-CLANG:
orCHECK-GNU:
, check for a major version withCHECK-CLANG-12:
and even check for major and minor releases asCHECK-CLANG-5_0
. This allows flexibility in creating exceptions to the rule.The fourth change is that this commit fully supports the NVIDIA HPC compiler nvc++. The tests should pass on clang++, g++ and nvc++.
As this is a proof of concept at this point, I have modified only the donotoptimize_assembly_test.cc test, there is still clobber_memory_assembly_test.cc and state_assembly_test.cc to rewrite the rules.
Thank you!