Closed jberci closed 2 years ago
These are LLVM passes. Is there any reason to believe this is a rustc bug and not an LLVM bug?
I wouldn't say so, but a rust reproducer is the only one I have and I couldn't find a way to open an issue on the rust-lang/llvm-project fork. Apologies if I missed it.
cc @jethrogb
cc @scottconstable @raoulstrackx
Are these crates being compiled with the MIP optimization plugin?
Probably not
Then I think the culprit is probably this issue in CodeGen/RDF that I reported back in April: http://lists.llvm.org/pipermail/llvm-dev/2020-April/141332.html. Looks like it hasn't been fixed yet. I'll open up a bug report and ping Krzysztof.
Bug report created: https://bugs.llvm.org/show_bug.cgi?id=46808. Hopefully this will address the issue reported by @jberci
I've reduced the example from the initial post to one file without external dependencies: main.rs
$ time rustc main.rs
real 0m2.223s
user 0m2.115s
sys 0m0.107s
$ time rustc -C llvm-args=--x86-experimental-lvi-inline-asm-hardening -C target-feature=+lvi-cfi,+lvi-load-hardening main.rs
real 1m47.199s
user 1m47.015s
sys 0m0.158s
@jethrogb Would it be possible to run that example with Linux perf to see which function is responsible for adding all those cycles?
63.13% rustc libLLVM-10-rust-1.46.0-nightly.so [.] std::_Function_handler<void (llvm::ImmutableGraph<llvm::MachineInstr*, int>::Node const*, bool), (anonymous namespace)::X86LoadValueInjectionLoadHardeningPass::elimMitigatedEdgesAndNodes((anonymous namespace)::MachineGadgetGraph&, llvm::ImmutableGraph<llvm::MachineInstr*, int>::EdgeSet&, llvm::ImmutableGraph<llvm::MachineInstr*, int>::NodeSet&) const::$_3>::_M_invoke
7.65% rustc libLLVM-10-rust-1.46.0-nightly.so [.] std::_Rb_tree<std::pair<unsigned int, llvm::LaneBitmask>, std::pair<unsigned int, llvm::LaneBitmask>, std::_Identity<std::pair<unsigned int, llvm::LaneBitmask> >, std::less<std::pair<unsigned int, llvm::LaneBitmask> >, std::allocator<std::pair<unsigned int, llvm::LaneBitmask> > >::_M_insert_unique<std::pair<unsigned int, llvm::LaneBitmask> >
6.35% rustc libLLVM-10-rust-1.46.0-nightly.so [.] llvm::rdf::RegisterAggr::makeRegRef
5.97% rustc libLLVM-10-rust-1.46.0-nightly.so [.] llvm::rdf::Liveness::computePhiInfo
2.84% rustc rustc [.] malloc
2.55% rustc rustc [.] free
1.46% rustc libLLVM-10-rust-1.46.0-nightly.so [.] std::_Rb_tree_increment
1.42% rustc libLLVM-10-rust-1.46.0-nightly.so [.] llvm::rdf::RegisterAggr::insert
1.32% rustc libc-2.23.so [.] __memset_avx2
1.19% rustc libLLVM-10-rust-1.46.0-nightly.so [.] llvm::rdf::RegisterAggr::clearIn
0.96% rustc libLLVM-10-rust-1.46.0-nightly.so [.] (anonymous namespace)::X86LoadValueInjectionLoadHardeningPass::runOnMachineFunction
0.85% rustc librustc_driver-582725d49f41b219.so [.] rustc_data_structures::obligation_forest::ObligationForest<O>::process_obligations
0.61% rustc libLLVM-10-rust-1.46.0-nightly.so [.] (anonymous namespace)::X86LoadValueInjectionLoadHardeningPass::elimMitigatedEdgesAndNodes
So it looks like ~63% is due to the LVI load hardening pass (specifically, the greedy heuristic for inserting LFENCEs), and another ~24% comes from RDF. I take it that Rust probably does a lot of aggressive inlining and LTO, and the function being mitigated is just absolutely huge?
I take it that Rust probably does a lot of aggressive inlining and LTO, and the function being mitigated is just absolutely huge?
Yes to all of the above. Most of the 5000-line file is one function. Is it unreasonable to expect such a case to ever be fast with LVI mitigations?
I think it should still be possible to have fast compilation for very large functions. I'm curious about the observation that a newer uint
crate seems to solve the issue, as @jberci had observed. Does the updated crate also contain a 5,000 LoC function? There might be something more subtle going on.
I created a patch that should fix the overhead caused by X86LoadValueInjectionLoadHardeningPass::elimMitigatedEdgesAndNodes() https://reviews.llvm.org/D84471
A complete solution will also require a patch to address the complexity issues in RDF.
@jberci would you be able to test your build with Scott's patch?
Definitely. I built a plain rustc
from commit 8ad7bc3f428 (mostly to check if I'm even doing it right) and another one with the patch from D84471 applied on top of that.
For uint
0.2.1, the plain one took around half an hour to compile my sample, as before, and the one with D84471 applied took just under 4 minutes. (And disabling the LVI passes brings that down to 10s). With the newer 0.4 version of uint
, both rustc
builds behave about the same (~45s with LVI passes, ~25s without).
I'm still working on building the original project that surfaced this with the patched rustc
, but from the rest of what I could test so far locally, the effect should be the same as for the sample here.
To summarize where we stand now, we have committed a patch to address the slowdown in the LVI LFENCE insertion algorithm. Krzysztof has committed several patches on the RDF side that reduce but may not completely eliminate the overhead for live variables analysis (see https://bugs.llvm.org/show_bug.cgi?id=46808). Maybe it would be good to take stock of where we stand right now with the compile time overhead. If the overhead remains too high, then we may need to do some more involved reworking of some of the RDF algorithms to make them more efficient.
Ok we now have a Rust nightly that includes the patches from @scottconstable and Krzysztof. @jberci are you happy with the compile times now?
As far as I can tell, this has been resolved, so closing the issue. Please let me know if this is not the case.
There seems to be a significant compile time regression when using the LVI mitigation passes on some crates. The smallest case I could come up with is:
TOML:
This happens with the
x86_64-fortanix-unknown-sgx
target, but also without it with the following codegen options (.cargo/config.toml
):Expected: Without the
config.toml
on a vanilla rustup-installed nightly, acargo clean && cargo build
takes a second or two.Instead: With LVI mitigations enabled, compile time explodes to about 35 minutes on my machine.
If I disable linking dead code, it drops to 2 minutes, although that's mostly because the sample above is simple. Using shorter uints (e.g. the ones commented out above) makes it drop further still, but compile time still noticeably depends on the uint length.
Using a newer version of the
uint
crate (e.g. 0.4.1) also seems to solve the issue. The main difference that I could find is that the older version uses inline assembly. Not sure if that's a red herring or not.Meta
rustc --version --verbose
: