rust-lang / rustc_codegen_cranelift

Cranelift based backend for rustc
Apache License 2.0
1.52k stars 94 forks source link

Extremely long Naga and wgpu compile times in release mode on Windows. #1344

Closed Elabajaba closed 1 year ago

Elabajaba commented 1 year ago

Hi, I was testing this out to see if it would work for building Bevy on Windows and debug builds worked reasonably well (a bit faster build time for a bit slower runtime on Windows+MSVC).

Then I tried a release build, to see if it was possible to get better runtime performance at similar build times to regular llvm rustc debug builds, and it had some issues. It ended up taking 192.4s (on a 5900x with 32GB RAM), during which it spent 165.8s building wgpu, with 164.2s of that being codegen, and 119.8s building naga (115.1s of which was codegen).

I confirmed that this wasn't some weird issue due to building Bevy by cloning and building the naga repo, and it took a similar amount of time.

bjorn3 commented 1 year ago

That is quite a bad result. I can't reproduce it locally on my Linux laptop (2 core + HT, 4GB ram) where it finished everything in 130s (after substracting time to update the registry and fetching crates) during which all cores were fully loaded the entire time. Wgpu took only 43s and naga 46s. I will need some more information to be able to pin point the issue. Could you please run cargo clean and then RUSTFLAGS="-Zself-profile=/path/to/your/project/self_profile_results" dist/cargo-clif build --release and upload the files starting with wgpu- and naga- in the self_profile_results directory in your project generated by the previous command? The self-profile output shows on what exactly rustc (and by extension cg_clif) spent how much time. Note that you won't be able to upload the whole self_profile_results directory as it is over 1GB which is way past the github upload limit. For me wgpu and naga combined were only 50MB (or 100MB when including wgpu_core, wgpu_hal and wgpu_types too).

lqd commented 1 year ago

(you mean -Zself-profile rather than -Ztime-passes ?)

bjorn3 commented 1 year ago

Yeah, fixed. Thanks!

Elabajaba commented 1 year ago

I had to split it into 2 zip files since github has a 25MB limit (and doesn't support anything better than .zip). self_profile_results-1.zip (naga+wgpu core) self_profile_results-2.zip (wgpu hal, wgpu types, and wgpu)

bjorn3 commented 1 year ago

For wgpu there is a codegen unit (on thread 5) taking 44s to compile spread out over a dozen functions and another codegen unit (on thread 16) taking 2.2 min where compiling a single function completely dwarfs all other functions.

image

For naga there is a single function taking 1.8 min (on thread 4)

image

It would really help to know which functions are so slow to compile. I work on some code to show function names in the self profile results tomorrow.

bjorn3 commented 1 year ago

(Recording timings for the various passes of cranelift using generic_activity_with_arg_recorder would also be very useful in this case.)

Elabajaba commented 1 year ago

(Recording timings for the various passes of cranelift using generic_activity_with_arg_recorder would also be very useful in this case.)

Hopefully I did it right. It was quite a bit bigger so I had to split them up even more this time. naga-0023456.mm_profdata.zip wgpu_core-0011120.mm_profdata.zip wgpu_hal_types.zip wgpu-0005296.mm_profdata.zip

Elabajaba commented 1 year ago

In case it helps, I was using the msvc artifact downloaded from https://github.com/bjorn3/rustc_codegen_cranelift/actions/runs/4065395449

Elabajaba commented 1 year ago

I just tested on Linux (same computer, Fedora 37) and it's similar but less bad (using the linux artifact from the same actions run). In total it took 89.7s to run cargo-clif build --example many_foxes --release, with wgpu taking 71.0s (69.9s codegen) and naga taking 66.3s (63.4s codegen).

Elabajaba commented 1 year ago

In case it helps, I was using the msvc artifact downloaded from https://github.com/bjorn3/rustc_codegen_cranelift/actions/runs/4065395449

I just built my own from master (21bdff8bc015b557bfae85b968dc4c100be85caf, ./y.rs build with no options, except -Ctarget-cpu=native in RUSTFLAGS), and it doesn't have this issue. (cargo-clif build --example many_foxes --release only took 24.5s)

Elabajaba commented 1 year ago

In case it helps, I was using the msvc artifact downloaded from https://github.com/bjorn3/rustc_codegen_cranelift/actions/runs/4065395449

Did my own build on Windows MSVC using the master branch, and it's also fine? (apart from linking being extremely slow if you don't use lld)

I also tried downloading a different MSVC CI artifact, and it also hangs for a similar amount of time.

afonso360 commented 1 year ago

I can also confirm, that I tried to reproduce this yesterday with a local build and didn't experience any major slowdowns.

Except for the linker thing, which should be fixed once https://github.com/bytecodealliance/wasmtime/pull/5550 is merged.

I'll try to reproduce it with CI artifacts.

Edit: Unfortunately I wasn't able to reproduce it with the CI artifacts which is weird. Here's my cargo timing report

bjorn3 commented 1 year ago

(Recording timings for the various passes of cranelift using generic_activity_with_arg_recorder would also be very useful in this case.)

Hopefully I did it right. It was quite a bit bigger so I had to split them up even more this time. naga-0023456.mm_profdata.zip wgpu_core-0011120.mm_profdata.zip wgpu_hal_types.zip wgpu-0005296.mm_profdata.zip

Unfortunately these files seem to be corrupt. Crox produces a near empty json file. Summarize doesn't show anything.

bjorn3 commented 1 year ago

https://github.com/bjorn3/rustc_codegen_cranelift/pull/1346 added more details to the self profile output. Could you try RUSTFLAGS="-Zself-profile=/path/to/your/project/self_profile_results -Zself-profile-events=default,function-args" dist/cargo-clif build --release with the latest CI build? (https://github.com/bjorn3/rustc_codegen_cranelift/actions/runs/4086944682)

Elabajaba commented 1 year ago

Well, those flags added a ton to the build times (7m25s instead of ~3m10s).

self_profile_results-1.zip self_profile_results-2.zip

bjorn3 commented 1 year ago

Thanks! I found the issue. To catch more bugs in CI, I made the build system enable debug assertions for cg_clif itself when building on CI. One of the effects of this is enabling the regalloc verifier, which can be very slow. Because debug assertions are enabled when building in CI, the CI artifacts have debug assertions and thus the regalloc verifier enabled, but local builds don't, hence why you could only reproduce this issue with a cg_clif from CI and not with one built by yourself. I think I will need to create separate dist jobs for CI artifacts that don't enable debug assertions.

bjorn3 commented 1 year ago

Opened https://github.com/bjorn3/rustc_codegen_cranelift/pull/1347

Elabajaba commented 1 year ago

Fixed in #1347