8.5% runtime performance degradation for GNU-toolchain 1.45.x vs. 1.44.x in Windows

mvanthoor commented 4 years ago

Hello,

For some time now I've been writing a chess engine in Rust. I develop on Windows at this time, but because I wanted to use only open source software for this engine, the GNU toolchain has been my primary target. I've been using the MSVC toolchain for profiling. Generally, This was the speed ranking:

GNU toolchain
MSVC toolchain using the LLD linker (installed on MSYS2)
MSVC toolchain with it's normal MS linker

In version 1.45.x, I've seen a great performance drop for the GNU toolchain, when running the compiled program.

Steps to reproduce:

Clone this repository: https://github.com/mvanthoor/rustic Install these toolchains: stable-x86_64-pc-windows-gnu stable-x86_64-pc-windows-msvc (default) 1.44.1-x86_64-pc-windows-gnu 1.44.1-x86_64-pc-windows-msvc* Compile the engine; make sure you have RUSTFLAGS set for cpu-type=skylake Run the engine "rustic.exe".

What it will do is a so called "perft": a performance test. It calculates moves in a given position to a given depth, and outputs the found move counts and speed per depth to the screen. The point of "perft" is to match the numbers known to be correct values, and to make it as fast as possible. You do this for a few hundred positions: if all the numbers match up to the known correct ones, move generation in your engine can be considered bug-free.

As said, in version 1.45, with the GNU toolchain, I have observed a great performance drop. The results on my computer are below. The values given are those for Perft 6, which in this case, is the most interesting.

The reason is that everything below perft 6 is only a few seconds or fractions of seconds, so any deviation is within the margin of error on a modern CPU, while estimated run time for perft 7 in the position I'm using would be about 2.75 hours. The numbers for Perft 5 and 6 are always consistent, with a margin of +/- 0.5 seconds on my system.

As you can see, a compile made with the GNU 1.45.x toolchain ran in 217 seconds... slowest of the bunch, down from a class-leading performance of 200 seconds in version 1.44.x. The MSVC toolchain with the normal MS-linker did get a nice speed boost in version 1.45.

Therefore, to obtain maximum speed, I would either need to stick with version GNU 1.44.1. Even if I switch to version MSVC version 1.45.2 to be able to upgrade, this is still slower than GNU 1.44.1.

(Somewhere along the line during the last few versions, the MSVC/LLD combo first lost speed to 212 seconds, and then gained speed again to 207 seconds. I have seen compiles with this toolchain running at 203 seconds for Perft 6. From all the toolchains across versions, this is the least consistent.)

If you have any questions or need something tested, please ask. In the next Rust version, I'd very much like to see the GNU toolchain regain the speed it had with 1.44.1 and earlier.

===============================================================================

Summary

Perft 6 results:

1.44.1 GNU : 200.23 seconds
1.45.2 MSVC : 202.88 seconds
1.44.1 MSVC : 206.70 seconds
1.45.2 MSVC/LLD : 207.73 seconds
1.44.1 MSVC/LLD : 212.22 seconds
1.45.2 GNU : 217.18 seconds

===============================================================================

1.44.1-x86_64-pc-windows-gnu RUSTFLAGS=-C target-cpu=skylake

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (101 ms, 40451514 leaves/sec) Perft 5: 193690690 (4706 ms, 41158242 leaves/sec) Perft 6: 8031647685 (200227 ms, 40112710 leaves/sec) Total time spent: 205036 ms Execution speed: 40136970 leaves/second

===============================================================================

stable-x86_64-pc-windows-gnu RUSTFLAGS=-C target-cpu=skylake

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (101 ms, 40451514 leaves/sec) Perft 5: 193690690 (4668 ms, 41493292 leaves/sec) Perft 6: 8031647685 (217176 ms, 36982206 leaves/sec) Total time spent: 221947 ms Execution speed: 37078779 leaves/second

===============================================================================

1.44.1-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (105 ms, 38910504 leaves/sec) Perft 5: 193690690 (4827 ms, 40126515 leaves/sec) Perft 6: 8031647685 (206698 ms, 38856920 leaves/sec) Total time spent: 211632 ms Execution speed: 38886009 leaves/second

===============================================================================

stable-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4728 ms, 40966728 leaves/sec) Perft 6: 8031647685 (202875 ms, 39589144 leaves/sec) Total time spent: 207707 ms Execution speed: 39620830 leaves/second

===============================================================================

1.44.1-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake Using lld-link.exe through MSYS2

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (105 ms, 38910504 leaves/sec) Perft 5: 193690690 (4839 ms, 40027007 leaves/sec) Perft 6: 8031647685 (207272 ms, 38749313 leaves/sec) Total time spent: 212218 ms Execution speed: 38778632 leaves/second

===============================================================================

stable-x86_64-pc-windows-msvc RUSTFLAGS=-C target-cpu=skylake Using lld-link.exe through MSYS2

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4729 ms, 40958065 leaves/sec) Perft 6: 8031647685 (202895 ms, 39585242 leaves/sec) Total time spent: 207728 ms Execution speed: 39616825 leaves/second

===============================================================================

Had a lot of issues with Cargo Bisect; many errors. Also if it does work, it seems to only compile the project and when it succeeds, it's OK. I'll have to look into this more. I'd appreciate some assistance. For now, I did the bisection by hand.

Start: nightly-2020-04-23-x86_64-pc-windows-gnu (last 1.44.0 nightly) GOOD 2020-05-14: GOOD 2020-05-18: GOOD 2020-05-20: GOOD 2020-05-21: GOOD 2020-05-22: BAD <== End: nightly-2020-06-05-x86_64-pc-windows-gnu (last 1.45.0 nightly) BAD

The 1.45.0 nightly from 2020-05-21 still gives the same results as the 1.44.0 nightly from 2020-04-23. The next one, from 2020-05-22, is the first to have this performance degradation.

I'd appreciate some assitance in using cargo-bisect to be able to find the PR that caused this slowdown.

petrochenkov commented 4 years ago

@mvanthoor There's a tool for bisecting nightly releases of rustc by running arbitrary scripts on them - https://crates.io/crates/cargo-bisect-rustc. If you could run it with your benchmark as a script and return "false" when the slowdown is detected, then we could determine the source of the regression very precisely.

retep998 commented 4 years ago

Why is there even any sort of performance difference based on which linker is used? Shouldn't Rust be generating the same code regardless?

mati865 commented 4 years ago

@retep998 when LTO is enabled linker will apply few optimisations on it's own.

I'm unable to reproduce on 2700X CPU using RUSTFLAGS="-C target-cpu=skylake" cargo run --release and windows-gnu targets:

LLD:
- nightly 220.7s
- stable 220.6s
- 1.44.1 223.1s
LD (BFD)
- nightly 219.8s
- stable 219.7s
- 1.44.1 221.9s

mvanthoor commented 4 years ago

@petrochenkov : Thanks, tomorrow I'll look into this tool. I've determined the last 1.44.0 nightly (good) and the last 1.45 nightly (which is bad). I'll run the bisect with start and end for those.

@mati865 : Thanks for testing. In your case, it actually seems that current rust is faster than 1.44.1. (It also seems that an LLD link is a tiny bit faster than an LD link). I'll have to try and link with LLD again on Windows when using GNU. Last time I tried, the engine crashed. LLD only worked on the MSVC toolchain, and there, it was faster than the default MSVC linker. Now, LLD is actually slower on MSVC.

Also, your testing results may be skewed: BMI2 execution on AMD-cpu's is poor on AMD compared to Intel, especially in the case of chess engines, as they can lean heavily on BMI2 instructions (speed-up is about 6% in my engine, compared to no BMI2).

See here: https://www.chessprogramming.org/BMI2

If the problem is in generating the BMI2 instructions, running the engine on an Intel CPU could possibly incur a bigger impact than when it's running on AMD.

mati865 commented 4 years ago

@mvanthoor FYI that's how I'm enabling LLD with windows-gnu:

$ cat ~/.cargo/config
[target.i686-pc-windows-gnu]
rustflags=["-Clink-arg=-fuse-ld=lld"]
[target.x86_64-pc-windows-gnu]
rustflags=["-Clink-arg=-fuse-ld=lld"]

You can simply add that to RUSTFLAGS without modifying Cargo config.

This issue sounds like LLVM could be potential cause here but it's mystery for me why only windows-gnu was affected in your case.

mvanthoor commented 4 years ago

@petrochenkov

Had a lot of issues with Cargo Bisect; many errors. Also if it does work, it seems to only compile the project and when it succeeds, it's OK. I'll have to look into this more. I'd appreciate some assistance. For now, I did the bisection by hand.

Start: nightly-2020-04-23-x86_64-pc-windows-gnu (last 1.44.0 nightly) GOOD 2020-05-14: GOOD 2020-05-18: GOOD 2020-05-20: GOOD 2020-05-21: GOOD 2020-05-22: BAD <== End: nightly-2020-06-05-x86_64-pc-windows-gnu (last 1.45.0 nightly) BAD

The 1.45.0 nightly from 2020-05-21 still gives the same results as the 1.44.0 nightly from 2020-04-23. The next one, from 2020-05-22, is the first to have this performance degradation. (I've quickly checked the PR's, and it seems this version is the one that includes the merge for LLVM10.)

I'd appreciate some assitance in using cargo-bisect to be able to find the PR that caused this slowdown.

Aaron1011 commented 4 years ago

@mvanthoor: Could you write a script that determines if the performance is 'good' or 'bad'? We can help you integrate that with cargo-bisect

mvanthoor commented 4 years ago

@Aaron1011

Of course.

If the speed difference between 1.44.1 and 1.45 had been 1-2 seconds, I wouldn't even have posted an issue. I've often seen fluctuations of 1-2 seconds up or down between Rust versions. Heck, I've seen improvements or degradations of 4-5 seconds due to refactoring of the code.

This slowdown of about 17 seconds is just too much to not report.

When compiled with GNU toolchain 1.44.1 and cpu-target=skylake, the current latest version of the program (in the "refactoring" branch in the repository) always yields this output:

Perft 1: 48 (0 ms, inf leaves/sec) Perft 2: 2039 (0 ms, inf leaves/sec) Perft 3: 97862 (2 ms, 48931000 leaves/sec) Perft 4: 4085603 (102 ms, 40054931 leaves/sec) Perft 5: 193690690 (4714 ms, 41088394 leaves/sec) Perft 6: 8031647685 (200430 ms, 40072083 leaves/sec) Total time spent: 205248 ms Execution speed: 40095513 leaves/second

The bold italic line for Perft 6 is the interesting one. Shorter perfts are too short, longer ones are not practical. The time taken for Perft 6 fluctuates by +/- 0.5 seconds. (More if I touch the computer, and start a program, obviously, but that shouldn't be done.)

The script is as follows:

#!/bin/bash

MAX_TIME=200
SPEED=$(./target/release/rustic.exe | grep -i 'perft 6' | cut -b22-24)

[[ "$SPEED" -le "$MAX_TIME" ]] && echo 'good' || echo 'bad'

It cuts out the time in seconds: Perft 6: 8031647685 (200430 ms, 40072083 leaves/sec)

If this is 200 or less, performance is 'good'. Otherwise, it's 'bad'. For the purpose of this test, MAX_TIME could be set to 201 or 202 to eliminate some more fluctuations. Obviously, this script only runs the engine, and then reports 'good' or 'bad' based on the time taken.

Hope this helps. Please ask anything you need to know, or advice where needed, if this script needs to do anything else to get cargo-bisect to run correctly.

I haven't gotten cargo-bisect to run correctly; the farthest I have come was to have it compile the program for all nightly builds, and because it doesn't break during compiling, there's no problem... when I try to run cargo-bisect between two commits as described in the tutorial, it either crashes with a panic, or it immediately stops because 'start commit already has the regression'.

Thanks for your help in advance.

spastorino commented 4 years ago

Assigning P-medium as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

mvanthoor commented 4 years ago

Hi,

I've consolidated the repository a bit to make this easier to check.

The branch where I started to refactor is "add_king_square". This refactor turned out to be very slow in Rust 1.45.x, as described in the above thread. Therefore I started again from "add_king_square", doing an alternative refactor; roughly the same, but not entirely.

In the end I've removed intermediate branches and did some renames on the branches. The current situation is:

Branch "refactoring" is the latest current state of the program Branch "speed degradation" contains the exact version for which I started this thread.

While refactoring, I've seen slowdowns creep in, but it's not possible to pinpoint which refactoring pass causes the slowdown, because the results always fluctuate a bit. The version in the branch "refactoring" does not yet have all the refactors that the one in "speed_degradation" has (but in the end it will, just because the code is more readable and maintainable), but I'm already seeing a measurable and steady slowdown:

Branch "refactoring" in Rust 1.44.1: 197.x seconds Branch "refactoring" in Rust 1.45.x: 200.x seconds

The branch "refactoring" is consistently 2.5 seconds slower In Rust 1.45.x, than in 1.44.1. So, the slowdown is not as dramatic as in the "speed degradation" branch, but there's already something there. I wonder if it's going to get slower and and slower during refactoring and be at 217 seconds again in Rust 1.45 when this branch has all the refactors that "speed_degradation" has.

Tonight or tomorrow I'll test if it's the same version of Rust (the version that switched to LLVM 10) that causes the slowdown.

mati865 commented 2 years ago

Is this issue still relevant?

rust-lang / rust

8.5% runtime performance degradation for GNU-toolchain 1.45.x vs. 1.44.x in Windows #75374