rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.35k stars 12.72k forks source link

Significant application performance degradation when making a module public #128730

Open day01 opened 3 months ago

day01 commented 3 months ago

I've encountered an unexpected and significant performance degradation in my application when changing a module's visibility from private to public. This behavior seems counterintuitive and potentially indicates a compiler optimization issue or an unexpected interaction between module visibility and performance.

my code:

mod black:

source: https://github.com/hayden4r4/blackscholes-rust/blob/master/src/lets_be_rational/mod.rs#L8 benches: https://github.com/hayden4r4/blackscholes-rust/blob/master/benches/black.rs

Current behavior

When the black module is made public, the overall application performance degrades by approximately 50%.

Expected

Changing a module's visibility should not have a significant impact on the application's overall performance. We would expect minimal to no performance change when modifying module visibility.

Environment

Rust version: 1.80.0 (051478957 2024-07-21) Cargo version: 1.80.0 (376290515 2024-07-16) os: macos m1 max

Anyone may know where is the bug/problem/challange?

Noratrieb commented 3 months ago

Does this reproduce when setting codegen-units to 1 in Cargo.toml?

day01 commented 3 months ago

I already have set units to 1: https://github.com/hayden4r4/blackscholes-rust/blob/master/Cargo.toml#L14

benches are in release mode so with units 1 n lto - fat.

So results are with these configuration.

Noratrieb commented 3 months ago

Visibility has an impact on many codegen-related decisions like codegen-unit partitioning and function instantiation, so it's not necessary surprising that you got different behavior. It would be useful to have a minimal reproduction here, including assembly, that shows exactly which function mattered (I assume you have pub functions in the module, make each of them private to test) and extract something minimal out of your project that reproduces the regression (just different assembly is good enough for a minimal reproduction, doesn't necessarily need benchmarks). Then we can figure out why exactly the different optimization decision was made and whether there's anything we can do to improve it. @rustbot label E-needs-mcve C-bug T-compiler A-codegen

saethlin commented 3 months ago

The visibility of this function is what matters: https://github.com/hayden4r4/blackscholes-rust/blob/59330ae06d31baea02fb4c5af18451e48c85da0f/src/lets_be_rational/black.rs#L65

I have no idea why. After some brief profiling with perf I don't see a difference in the fast and slow versions.

The usual hypothesis at this point is that somehow code alignment or some cache or branch predictor collision matters. The effect size is about right, but I don't know how to test that hypothesis.

I did this investigation on x86_64. The fact that this reproduces so exactly across architectures makes me doubt that this is a microarchitecture issue... But I'm not sure what else it could be.

day01 commented 3 months ago

@Noratrieb I tried repro it but i cannot :) so it is why i added whole solution with description how to enable perf degradation. @saethlin thanks for repro on x86_64 and target direct function.

Additionally i checked it with inline always on all and random from function in L65. I hope the blackscholes solution is relative small with direct marked problem so I hope it can help as an example.

day01 commented 2 months ago

any idea what is it or how to fix it?