Open mvanthoor opened 4 years ago
@mvanthoor There's a tool for bisecting nightly releases of rustc by running arbitrary scripts on them - https://crates.io/crates/cargo-bisect-rustc. If you could run it with your benchmark as a script and return "false" when the slowdown is detected, then we could determine the source of the regression very precisely.
Why is there even any sort of performance difference based on which linker is used? Shouldn't Rust be generating the same code regardless?
@retep998 when LTO is enabled linker will apply few optimisations on it's own.
I'm unable to reproduce on 2700X CPU using RUSTFLAGS="-C target-cpu=skylake" cargo run --release
and windows-gnu targets:
@petrochenkov : Thanks, tomorrow I'll look into this tool. I've determined the last 1.44.0 nightly (good) and the last 1.45 nightly (which is bad). I'll run the bisect with start and end for those.
@mati865 : Thanks for testing. In your case, it actually seems that current rust is faster than 1.44.1. (It also seems that an LLD link is a tiny bit faster than an LD link). I'll have to try and link with LLD again on Windows when using GNU. Last time I tried, the engine crashed. LLD only worked on the MSVC toolchain, and there, it was faster than the default MSVC linker. Now, LLD is actually slower on MSVC.
Also, your testing results may be skewed: BMI2 execution on AMD-cpu's is poor on AMD compared to Intel, especially in the case of chess engines, as they can lean heavily on BMI2 instructions (speed-up is about 6% in my engine, compared to no BMI2).
See here: https://www.chessprogramming.org/BMI2
If the problem is in generating the BMI2 instructions, running the engine on an Intel CPU could possibly incur a bigger impact than when it's running on AMD.
@mvanthoor FYI that's how I'm enabling LLD with windows-gnu
:
$ cat ~/.cargo/config
[target.i686-pc-windows-gnu]
rustflags=["-Clink-arg=-fuse-ld=lld"]
[target.x86_64-pc-windows-gnu]
rustflags=["-Clink-arg=-fuse-ld=lld"]
You can simply add that to RUSTFLAGS
without modifying Cargo config.
This issue sounds like LLVM could be potential cause here but it's mystery for me why only windows-gnu
was affected in your case.
@petrochenkov
Had a lot of issues with Cargo Bisect; many errors. Also if it does work, it seems to only compile the project and when it succeeds, it's OK. I'll have to look into this more. I'd appreciate some assistance. For now, I did the bisection by hand.
Start: nightly-2020-04-23-x86_64-pc-windows-gnu (last 1.44.0 nightly) GOOD 2020-05-14: GOOD 2020-05-18: GOOD 2020-05-20: GOOD 2020-05-21: GOOD 2020-05-22: BAD <== End: nightly-2020-06-05-x86_64-pc-windows-gnu (last 1.45.0 nightly) BAD
The 1.45.0 nightly from 2020-05-21 still gives the same results as the 1.44.0 nightly from 2020-04-23. The next one, from 2020-05-22, is the first to have this performance degradation. (I've quickly checked the PR's, and it seems this version is the one that includes the merge for LLVM10.)
I'd appreciate some assitance in using cargo-bisect to be able to find the PR that caused this slowdown.
@mvanthoor: Could you write a script that determines if the performance is 'good' or 'bad'? We can help you integrate that with cargo-bisect
@Aaron1011
Of course.
If the speed difference between 1.44.1 and 1.45 had been 1-2 seconds, I wouldn't even have posted an issue. I've often seen fluctuations of 1-2 seconds up or down between Rust versions. Heck, I've seen improvements or degradations of 4-5 seconds due to refactoring of the code.
This slowdown of about 17 seconds is just too much to not report.
When compiled with GNU toolchain 1.44.1 and cpu-target=skylake, the current latest version of the program (in the "refactoring" branch in the repository) always yields this output:
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4714 ms, 41088394 leaves/sec) Perft 6: 8031647685 (200430 ms, 40072083 leaves/sec) Total time spent: 205248 ms Execution speed: 40095513 leaves/second
The bold italic line for Perft 6 is the interesting one. Shorter perfts are too short, longer ones are not practical. The time taken for Perft 6 fluctuates by +/- 0.5 seconds. (More if I touch the computer, and start a program, obviously, but that shouldn't be done.)
The script is as follows:
#!/bin/bash
MAX_TIME=200
SPEED=$(./target/release/rustic.exe | grep -i 'perft 6' | cut -b22-24)
[[ "$SPEED" -le "$MAX_TIME" ]] && echo 'good' || echo 'bad'
It cuts out the time in seconds: Perft 6: 8031647685 (200430 ms, 40072083 leaves/sec)
If this is 200 or less, performance is 'good'. Otherwise, it's 'bad'. For the purpose of this test, MAX_TIME could be set to 201 or 202 to eliminate some more fluctuations. Obviously, this script only runs the engine, and then reports 'good' or 'bad' based on the time taken.
Hope this helps. Please ask anything you need to know, or advice where needed, if this script needs to do anything else to get cargo-bisect to run correctly.
I haven't gotten cargo-bisect to run correctly; the farthest I have come was to have it compile the program for all nightly builds, and because it doesn't break during compiling, there's no problem... when I try to run cargo-bisect between two commits as described in the tutorial, it either crashes with a panic, or it immediately stops because 'start commit already has the regression'.
Thanks for your help in advance.
Assigning P-medium
as discussed as part of the Prioritization Working Group procedure and removing I-prioritize
.
Hi,
I've consolidated the repository a bit to make this easier to check.
The branch where I started to refactor is "add_king_square". This refactor turned out to be very slow in Rust 1.45.x, as described in the above thread. Therefore I started again from "add_king_square", doing an alternative refactor; roughly the same, but not entirely.
In the end I've removed intermediate branches and did some renames on the branches. The current situation is:
Branch "refactoring" is the latest current state of the program Branch "speed degradation" contains the exact version for which I started this thread.
While refactoring, I've seen slowdowns creep in, but it's not possible to pinpoint which refactoring pass causes the slowdown, because the results always fluctuate a bit. The version in the branch "refactoring" does not yet have all the refactors that the one in "speed_degradation" has (but in the end it will, just because the code is more readable and maintainable), but I'm already seeing a measurable and steady slowdown:
Branch "refactoring" in Rust 1.44.1: 197.x seconds Branch "refactoring" in Rust 1.45.x: 200.x seconds
The branch "refactoring" is consistently 2.5 seconds slower In Rust 1.45.x, than in 1.44.1. So, the slowdown is not as dramatic as in the "speed degradation" branch, but there's already something there. I wonder if it's going to get slower and and slower during refactoring and be at 217 seconds again in Rust 1.45 when this branch has all the refactors that "speed_degradation" has.
Tonight or tomorrow I'll test if it's the same version of Rust (the version that switched to LLVM 10) that causes the slowdown.
Is this issue still relevant?
Hello,
For some time now I've been writing a chess engine in Rust. I develop on Windows at this time, but because I wanted to use only open source software for this engine, the GNU toolchain has been my primary target. I've been using the MSVC toolchain for profiling. Generally, This was the speed ranking:
In version 1.45.x, I've seen a great performance drop for the GNU toolchain, when running the compiled program.
Steps to reproduce:
Clone this repository: https://github.com/mvanthoor/rustic Install these toolchains: stable-x86_64-pc-windows-gnu stable-x86_64-pc-windows-msvc (default) 1.44.1-x86_64-pc-windows-gnu 1.44.1-x86_64-pc-windows-msvc* Compile the engine; make sure you have RUSTFLAGS set for cpu-type=skylake Run the engine "rustic.exe".
What it will do is a so called "perft": a performance test. It calculates moves in a given position to a given depth, and outputs the found move counts and speed per depth to the screen. The point of "perft" is to match the numbers known to be correct values, and to make it as fast as possible. You do this for a few hundred positions: if all the numbers match up to the known correct ones, move generation in your engine can be considered bug-free.
As said, in version 1.45, with the GNU toolchain, I have observed a great performance drop. The results on my computer are below. The values given are those for Perft 6, which in this case, is the most interesting.
The reason is that everything below perft 6 is only a few seconds or fractions of seconds, so any deviation is within the margin of error on a modern CPU, while estimated run time for perft 7 in the position I'm using would be about 2.75 hours. The numbers for Perft 5 and 6 are always consistent, with a margin of +/- 0.5 seconds on my system.
As you can see, a compile made with the GNU 1.45.x toolchain ran in 217 seconds... slowest of the bunch, down from a class-leading performance of 200 seconds in version 1.44.x. The MSVC toolchain with the normal MS-linker did get a nice speed boost in version 1.45.
Therefore, to obtain maximum speed, I would either need to stick with version GNU 1.44.1. Even if I switch to version MSVC version 1.45.2 to be able to upgrade, this is still slower than GNU 1.44.1.
(Somewhere along the line during the last few versions, the MSVC/LLD combo first lost speed to 212 seconds, and then gained speed again to 207 seconds. I have seen compiles with this toolchain running at 203 seconds for Perft 6. From all the toolchains across versions, this is the least consistent.)
If you have any questions or need something tested, please ask. In the next Rust version, I'd very much like to see the GNU toolchain regain the speed it had with 1.44.1 and earlier.
===============================================================================
Summary
Perft 6 results:
===============================================================================
1.44.1-x86_64-pc-windows-gnu RUSTFLAGS=-C target-cpu=skylake
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (101 ms, 40451514 leaves/sec) Perft 5: 193690690 (4706 ms, 41158242 leaves/sec) Perft 6: 8031647685 (200227 ms, 40112710 leaves/sec) Total time spent: 205036 ms Execution speed: 40136970 leaves/second
===============================================================================
stable-x86_64-pc-windows-gnu RUSTFLAGS=-C target-cpu=skylake
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (101 ms, 40451514 leaves/sec) Perft 5: 193690690 (4668 ms, 41493292 leaves/sec) Perft 6: 8031647685 (217176 ms, 36982206 leaves/sec) Total time spent: 221947 ms Execution speed: 37078779 leaves/second
===============================================================================
1.44.1-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (105 ms, 38910504 leaves/sec) Perft 5: 193690690 (4827 ms, 40126515 leaves/sec) Perft 6: 8031647685 (206698 ms, 38856920 leaves/sec) Total time spent: 211632 ms Execution speed: 38886009 leaves/second
===============================================================================
stable-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4728 ms, 40966728 leaves/sec) Perft 6: 8031647685 (202875 ms, 39589144 leaves/sec) Total time spent: 207707 ms Execution speed: 39620830 leaves/second
===============================================================================
1.44.1-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake Using lld-link.exe through MSYS2
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (105 ms, 38910504 leaves/sec) Perft 5: 193690690 (4839 ms, 40027007 leaves/sec) Perft 6: 8031647685 (207272 ms, 38749313 leaves/sec) Total time spent: 212218 ms Execution speed: 38778632 leaves/second
===============================================================================
stable-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake Using lld-link.exe through MSYS2
Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4729 ms, 40958065 leaves/sec) Perft 6: 8031647685 (202895 ms, 39585242 leaves/sec) Total time spent: 207728 ms Execution speed: 39616825 leaves/second
===============================================================================
Had a lot of issues with Cargo Bisect; many errors. Also if it does work, it seems to only compile the project and when it succeeds, it's OK. I'll have to look into this more. I'd appreciate some assistance. For now, I did the bisection by hand.
Start: nightly-2020-04-23-x86_64-pc-windows-gnu (last 1.44.0 nightly) GOOD 2020-05-14: GOOD 2020-05-18: GOOD 2020-05-20: GOOD 2020-05-21: GOOD 2020-05-22: BAD <== End: nightly-2020-06-05-x86_64-pc-windows-gnu (last 1.45.0 nightly) BAD
The 1.45.0 nightly from 2020-05-21 still gives the same results as the 1.44.0 nightly from 2020-04-23. The next one, from 2020-05-22, is the first to have this performance degradation.
I'd appreciate some assitance in using cargo-bisect to be able to find the PR that caused this slowdown.