Open KOROBYAKA opened 8 months ago
Just tested on a slightly different CPU target: with line setting n=3 commented out:
Check_dumb n=3 returned None
Time taken 915.227343ms
Check with simd n=3 returned None
Time taken 4.634040885s
Wow 4.6 seconds is slower than javascript! Same but with line present:
Check_dumb n=3 returned None
Time taken 907.446979ms
Check with simd n=3 returned None
Time taken 502.535157ms
CPU used was 13th Gen Intel(R) Core(TM) i5-1335U, optimization was set for native:
[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]
Diving a bit into the assembly of check_smart, it seems that failure to figure out an appropriate const value for n results in pow function being splattered around the code of the hot loop, killing any sort of hope for performance. It is very odd that x.pow(n) is not moved out of the loop over z though, in which case it should not matter how it is implemented.
Strangely, this only seems to happen with target-cpu=native
(even on godbolt which is Zen 3), not if you use the baseline or explicitly specify target-cpu=raptorlake
(for the i5-1335U) or any other specific CPU I've tried (haswell, core-avx2, znver3, etc.): https://godbolt.org/z/jqrs7bP5W.
(I was expecting this to be the same root cause as https://github.com/rust-lang/rust/issues/112478#issuecomment-1586332367, where printing a value prevents LLVM from seeing that it's constant, since a pointer escapes into the printing machinery that could be used to mutate it, but it seems weirder than that.)
That is mighty odd indeed. On my machine the bug persists with raptorlake setting:
$$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=raptorlake" cargo run --release
Removed 11 files, 844.0KiB total
Compiling fermat v0.1.0 (/home/headhunter/fermat)
Finished release [optimized] target(s) in 0.26s
Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 932.948488ms
Check with simd n=3 returned None
Time taken 3.440539885s
LLVM complains about "failed to hoist load with loop-invariant address because load is conditionally executed" near calls to pow(n), but I'm not sure how to extract more info from its optimization remarks...
I've also tried targeting some other architectures that my machines could run with same result:
$$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=cascadelake" cargo run --release
Removed 11 files, 844.0KiB total
Compiling fermat v0.1.0 (/home/headhunter/fermat)
Finished release [optimized] target(s) in 0.25s
Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 910.116577ms
Check with simd n=3 returned None
Time taken 4.317862738s
$$ ~/fermat (master) [101]> cargo clean && RUSTFLAGS="-Ctarget-cpu=znver2" cargo run --release
Removed 7 files, 3.3KiB total
Compiling fermat v0.1.0 (/home/headhunter/fermat)
Finished release [optimized] target(s) in 0.25s
Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 924.990783ms
Check with simd n=3 returned None
Time taken 4.279846815s
$$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=znver3" cargo run --release
Removed 11 files, 844.0KiB total
Compiling fermat v0.1.0 (/home/headhunter/fermat)
Finished release [optimized] target(s) in 0.29s
Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 954.212518ms
Check with simd n=3 returned None
Time taken 3.432712025s
Also I've checked your hypothesis about printing killing optimization, and it seems to hold up. Modifying the callsite for check_smart to print n just before call to the function results in failure to optimize.
{
let n = 3;
println!("Value of n is {}", n.clone()); //removing clone() here will break optimization no matter how dumb that is!
let s = std::time::Instant::now();
println!("Check with simd n={} returned {:?}",n, check_smart(n, 1000));
println!("Time taken {:?}", s.elapsed());
}
The deeper question now is as follows: why would it even matter which value n takes? It is entirely irrelevant to the optimizations within the inner loop over z (as both x and y are constant there, and thus x.pow(n) can be safely hoisted out of the loop over z). Even if n could take some entirely bizarre value that would panic the call to pow, there is no point in keeping it within the inner loop, or am I missing something here? Does it actually matter where exactly panic is generated? Edit: u32.pow() does not panic, even on overflow. So it should be always safe to hoist calls to it, no matter value for n. Edit2: with new code structure, optimizer remark changed to "failed to hoist load with loop-invariant address because load is conditionally executed", but resulting asm is pretty much the same garbage. And the remark seems to have nothing to do with the actual problem at hand.
Managed to reproduce in Godbolt. Works reliably with different targets also (not just native). The difference in perf on AMD is quite dramatic, from 400ms to 9.5 seconds is crazy slow.
Tested again on latest rustc 1.78.0-nightly (c67326b06 2024-03-15) , no changes so far in either direction.
Tested again on 1.80, the issue no longer presents. Whatever made it go away is unclear. rustc 1.80.0-beta.3 (105fc5ccc 2024-06-14) - does not have issue rustc 1.79.0 (129f3b996 2024-06-10) - still has the issue
I'll keep watching if it reemerges.
Latest nightly rustc 1.82.0-nightly (92c6c0380 2024-07-21) has the problem again with the exact same symptoms as before. There is a clear regression here.
We have observed unexpected loss of vectorization effort by the compiler when compiling the code below. Target was amd64 machine with AVX2 instruction set supported.
I expected to see this happen: the function check_smart should leverage vectorization and SIMD instructions, while function check_dumb should be unable to do so. I should not have to specify n=3 twice in main(), as value of n from outer scope should be propagated into function check_smart. The vectorization in check_smart should happen for any value of n (even when n is not known ahead of time).
Instead, this happened: the ability of the compiler to actually perform SIMD vectorization of check_smart seems to depend on whether x.pow(n) can panic. When value of n is not propagated into the function, the optimizer fails to perform vectorization resulting in massive loss of performance.
Meta
rustc --version --verbose
:Nightly compiler exhibits identical behavior.
CPU is Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz