rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.5k stars 12.61k forks source link

Slow compilation of async fn that creates a vector with many Strings created from string literals, even with no .await #115327

Open shepmaster opened 1 year ago

shepmaster commented 1 year ago
fn main() {
    make_menagerie();
}

async fn make_menagerie() -> Vec<String> {
    vec![
        "monster".to_string(), // <=== repeat this line 1000 to 10000 times.
    ]
}

Build with cargo build. I'm using Rust 1.72 on aarch64-apple-darwin.

In addition to the slow compilation time, I also see large memory usage. With 2500 lines, my OS reported rustc as consuming 20GB.

Exploration

Different versions

There was no obvious correlation.

Version Time (sec)
1.63 18.984
1.64 19.224
1.65 18.797
1.66 18.993
1.67 18.917
1.68 17.994
1.69 18.349
1.70 21.379
1.71 20.655
1.72 22.587
image

Different numbers of lines

This appears to show time is proportional to (number of lines)^2

Lines Time
500 0.133
1000 2.831
1500 6.512
2000 13.545
2500 22.404
3000 35.894
image

String slices

Switching to returning a &'static str instead of String resulted in fast compilation (<250 ms).

Using a function

Creating a helper function and calling it inside of vec![] resulted in fast compilation (<250 ms).

fn make_string() -> String { "monster".to_string() }

Using another method

Switching to returning a usize instead of a String has reduced compilation time but still surprisingly slow (2.5 sec).

vec![
    "monster".len(),
]

Avoiding async fn

Removing async from make_menagerie has reduced compilation time (<750 ms)

Test case generation

For trying different things, I used two files

main.rs

fn main() {
    make_menagerie();
}

async fn make_menagerie() -> Vec<String> {
    include!("repeat.rs")
}

And to generate 2500 lines:

(echo 'vec!['; yes '"monster".to_string(),' | head -n 2500; echo ']') > src/repeat.rs
nnethercote commented 1 year ago

I profiled this a bit and a lot of the time is spent in MIR data analysis: MaybeUninitializedPlaces, EverInitializedPlaces, MaybeInitializePlaces, iterate_to_fixpoint, etc. Plus lots of ChunkedBitSet manipulations within these.

But the real problem is a quadratic amount of unwinding code generated. I looked at the MIR generated with 100 lines in the vector. One key part is this:

    let mut _9: std::string::String;
    let _10: &str;
    let mut _11: std::string::String;
    let _12: &str;
    ...
    let mut _205: std::string::String;
    let _206: &str; 
    let mut _207: std::string::String; 
    let _208: &str;

which makes sense. There are also vast numbers of blocks like this:

    bb104 (cleanup): {
        drop(_205) -> [return: bb105, unwind terminate(cleanup)];
    }   

    bb105 (cleanup): {
        drop(_203) -> [return: bb106, unwind terminate(cleanup)];
    }   

    bb106 (cleanup): {
        drop(_201) -> [return: bb107, unwind terminate(cleanup)];
    }

In fact, a quadratic number of blocks: 99 for _9, 98 for _11, ..., 1 for _205, 0 for _207.

Because they are in chains:

It seems possible for these separate chains to be merged into a single chain that drops _205, _203, ..., _11, _9, and then the other chains are handled by just jumping into different blocks as necessary. How hard that is to implement, I have no idea.

shepmaster commented 1 year ago

Because they are in chains

Interesting!

It looks like if you extract a function [^1] and call that function in the vec! macro, then there's only one chain created with the same properties you want. The single drop chain also appears if you make it into a non-async function.

This makes it seem like there's already some piece of code that does the drop chain optimization, but perhaps it's not advanced enough to handle whatever the async transform does to the code.

[^1]: fn make() -> String { "monster".to_string() }

nnethercote commented 1 year ago

I just grepped for "drop.*chain" in the source code and found this, which seems awfully relevant: scope.rs

lqd commented 1 year ago

The 1.69 to 1.70 30% bump bisects to #109458.

(Also, possibly interesting for its future stabilization path, -Zdrop-tracking-mir makes things much worse. This is admittedly a stress test, but at N=1250 it's like 22% on a debug build and 85% on a check build)

tmandry commented 1 year ago

I profiled this a bit and a lot of the time is spent in MIR data analysis: MaybeUninitializedPlaces, EverInitializedPlaces, MaybeInitializePlaces, iterate_to_fixpoint, etc. Plus lots of ChunkedBitSet manipulations within these.

This part sounds specific to async; the generator capture analysis runs these.

But the real problem is a quadratic amount of unwinding code generated.

This should not be specific to async and points to general improvements we can make in MIR building.

nnethercote commented 1 year ago

This should not be specific to async and points to general improvements we can make in MIR building.

But @shepmaster said that removing the async keyword greatly reduced compile time.