`cargo test` performance regression on Windows on version 1.75.0

Jefffrey commented 5 months ago

Code

DataFusion issue with complete details: https://github.com/apache/arrow-datafusion/issues/8696

arrow-datafusion runs cargo test on Windows runner as part of CI. When on Rust version 1.74.1 (and below), the check takes under 30 minutes. After upgrading to Rust version 1.75.0, it now takes over 3 hours, with no other change in code on our side. This seems to only take effect on Windows, as Linux/Mac tests didn't seem to be affected.

After debugging, I found the regression occurs between toolchains nightly-2023-10-29 (rust e5cfc5547) and nightly-2023-10-30 (rust 608e9682f).

We are running on GitHub actions runner windows-latest.

So expected run is here, on toolchain nightly-2023-10-29: https://github.com/apache/arrow-datafusion/actions/runs/7394674719/job/20116418078

Commit: https://github.com/apache/arrow-datafusion/commit/c85e67c5a77d5166bd11e9b31db3b9189c4d0325
Duration of 23m 43s

When I bump to toolchain nightly-2023-10-30, with no other code changes: https://github.com/apache/arrow-datafusion/actions/runs/7394848426/job/20116910586

Commit: https://github.com/apache/arrow-datafusion/commit/2e99e2151e1bb34dd16c40c364a8e1363346d422
Duration of 3h 22m 44s

Slow tests

The slowness occurs primarily in two tests.

tpcds_planning

Test code: https://github.com/apache/arrow-datafusion/blob/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/core/tests/tpcds_planning.rs

On the good run (before regression):

Log: https://github.com/apache/arrow-datafusion/actions/runs/7394674719/job/20116418078#step:5:1886
Duration: 8m 15s

On the bad run (after regression):

Log: https://github.com/apache/arrow-datafusion/actions/runs/7394848426/job/20116910586#step:5:1888
Duration: 1h 24m 59s

sqllogictest

Test code: https://github.com/apache/arrow-datafusion/tree/1179a76567892b259c88f08243ee01f05c4c3d5c/datafusion/sqllogictest

On the good run (before regression):

Log: https://github.com/apache/arrow-datafusion/actions/runs/7394674719/job/20116418078#step:5:5691
Duration: 6m 3s (calculated from the timestamps from start of run, until start of next test run)

On the bad run (after regression):

Log: https://github.com/apache/arrow-datafusion/actions/runs/7394848426/job/20116910586#step:5:5717
Duration: 1h 46m

These two tests are the only ones with a significant delta, the rest don't seem affected by the upgrade.

Version it worked on

Ran fast with Rust 1.74.1 (and nightly-2023-10-29)

Version with regression

Ran slow on Rust 1.75.0 (and nightly-2023-10-30)

Additional context

Apologies if the example is too large to easily determine where the issue is. I'll try to reduce this to a smaller MRE, as I don't have a Windows machine to locally test on, so have had to check via CI.

comphead commented 5 months ago

Maybe this also helps a bit to investigate, if we run datafusion windows CI with

RUSTFLAGS: "-C debuginfo=0 -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256"

it helps to reduce timing from 190m to 60m, but it still way longer than before

ChrisDenton commented 5 months ago

It might help to look at the raw log and narrow down which steps specifically are taking a long time.

So the issue is maybe around the generation of debuginfo?

Jefffrey commented 5 months ago

Just a quick update, narrowed down the testing:

Limit to running only tpcds_physical_q44 with backtrace feature:

cargo test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
Where our backtrace feature will enable backtrace from std:
- https://github.com/apache/arrow-datafusion/blob/ca260d99f17ef667b7f06d2da4a67255d27c94a9/datafusion/common/src/error.rs#L18-L20
- https://github.com/apache/arrow-datafusion/blob/ca260d99f17ef667b7f06d2da4a67255d27c94a9/datafusion/common/src/error.rs#L429-L451

good (2023-10-29 nightly) = 70 seconds

bad (2023-10-30 nightly) = 692 seconds

Without backtrace:

cargo test -p datafusion --test tpcds_planning tpcds_physical_q44

good (2023-10-29 nightly) = 1.25 seconds

bad (2023-10-30 nightly) = 1.26 seconds

Edit:

Also just to be clear, we are running with env vars:

RUSTFLAGS: "-C debuginfo=0"
RUST_BACKTRACE: "1"

For both good & bad runs.

Maybe this also helps a bit to investigate, if we run datafusion windows CI with
RUSTFLAGS: "-C debuginfo=0 -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256"
it helps to reduce timing from 190m to 60m, but it still way longer than before

The speedup seems to occur due to adding -C opt-level=1 -C target-feature=+crt-static -C incremental=false -C codegen-units=256 flags as the debuginfo flag was already set previously

saethlin commented 5 months ago

https://github.com/rust-lang/rust/pull/117089 was merged in the right window, and the PR description notes that it contains https://github.com/rust-lang/backtrace-rs/pull/569 which sounds relevant.

apiraino commented 5 months ago

cc @wesleywiser who authored https://github.com/rust-lang/rust/pull/117089 and changes to the backtrace-rs submodule (IIUC)

wesleywiser commented 5 months ago

I'm able to repro locally:

> $env:RUSTFLAGS="-Cdebuginfo=0"
>  cargo +nightly-2023-10-29 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
test tpcds_physical_q44 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 197 filtered out; finished in 36.23s

> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
running 1 test
test tpcds_physical_q44 has been running for over 60 seconds

Looking at profiling data, the vast majority of time in the regressed case is spent in dbghelp!SymAddrIncludeInlineTrace which is a necessary call I introduced in rust-lang/backtrace-rs#569 when symbolizing the backtrace. I'm investigating if there is some other way to avoid calling that function but haven't had any luck.

One thing that's very interesting is that if do not set RUSTFLAGS="-Cdebuginfo=0", the regression disappears. This suggests to me we're hitting a pathological issue in dbghelp when there isn't any debug info present. On Windows, debug info is basically required to get any kind of meaningful backtrace so setting both -Cdebuginfo=0 and --feature backtrace in the test case doesn't make much sense to me.

@Jefffrey do you know why -Cdebuginfo=0 is being set? My assumption is that the intent is to make builds faster while still being able to test --feature backtrace. If that's the case, I would suggest actually setting -Cdebuginfo=line-tables-only which generates the minimal amount of information to give proper backtraces and doesn't have much impact on compile times (especially in comparison to -Cdebuginfo=2 which is the default in this case):

> $env:RUSTFLAGS="-Cdebuginfo=0"
>  cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 1m 25s

> $env:RUSTFLAGS="-Cdebuginfo=line-tables-only"
> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 1m 32s

> $env:RUSTFLAGS="-Cdebuginfo=2"
> cargo +nightly-2023-10-30 test -p datafusion --test tpcds_planning tpcds_physical_q44 --features backtrace
...
    Finished test [unoptimized + debuginfo] target(s) in 2m 26s

Jefffrey commented 5 months ago

Thank you for the detailed breakdown! @wesleywiser

@Jefffrey do you know why -Cdebuginfo=0 is being set? My assumption is that the intent is to make builds faster while still being able to test --feature backtrace. If that's the case, I would suggest actually setting -Cdebuginfo=line-tables-only which generates the minimal amount of information to give proper backtraces and doesn't have much impact on compile times (especially in comparison to -Cdebuginfo=2 which is the default in this case):

The reasoning you provide makes sense, I'm not sure why both debuginfo=0 was set along with backtrace (probably had debuginfo=0 set first then backtrace feature was added as a separate feature so needed to be added to test).

There was a comment stating do not produce debug symbols to keep memory usage down for the debuginfo=0 setting fwiw, so I guess that's the main reason

I'll experiment with setting debuginfo=line-tables-only and will see if that helps speed things back up :+1:

Jefffrey commented 5 months ago

Thanks for the config fix suggested @wesleywiser

This does eliminate the performance regression (or at least bring the times within a more reasonable range of each other).

Even though it was caused by some weird settings on our end (disabling debuginfo but still using backtrace, on Windows), it was still a significant regression in performance. Let me know if this is the expected behaviour now and I can close this issue, or to keep this open if you're planning to investigate this weird edge case further :+1: (or move it to backtrace repo?)

ChrisDenton commented 4 months ago

This was discussed in the libs meeting. While this is an edge case and a likely bug in dbghlp, it was suggested that maybe backtrace could workaround it by somehow detecting if there is no debug info present? Even if it can only tell after trying, it could perhaps cache this knowledge so it doesn't try again the next time?

rust-lang / rust