Open matthiaskrgr opened 6 years ago
I really appreciate the idea to improve documentation on this front. My current sources of information are
Maybe, we can use this issue to collect all the places where information about optimizing Rust executable can be found. In the end we could probably write a section (an appendix?) for TRPL.
We could also collect questions, e.g.
(I've actually long wanted to find an answer to these questions.)
After rethinking it, I don't believe a section in TRPL would be a good idea!
I think the important items (and their relations) to explain are
lto | thinlto | |
---|---|---|
codegen-units = 1 | ||
codegen-units = n | ||
opt-level = {0-3, ("s", "z")} |
It's also notable that releases (1.24, 1.25, 1.26) have different behaviours/bugs (for example #48163 sped up monolithic lto link time (compiletime) significantly).
From what I understand, monolithic lto merges all the object files into one huge translation unit and the just pretends we have the entire program inlined into a single file while running its optimizations on everything at once sequentially.
Thinlto, while compiling, writes interesting metadata for modules/functions (lets call it snippets) into an index. While doing the link time optimizations, it optimizes snippets in parallel while only loading into ram related snippets metadata (and not everything) which makes it use less memory than monolithic lto (no need to load everything at once) and scalable (optimize N snippets at a time). In the future we might even get incremental thinlto (only reoptimize snippets that or whose dependencies changed, see #47660 ). During a talk on thinlto it was said thinlto only performs a subset of the optimizations that monolithic lto is doing, however since it is lean in memory usage and optimizes in parallel, it can do its optimizations more aggressively without noticeable increase in compiletime (or out of memory exceptions :P )
By default cargo build --release
builds with opt-level=3
, however lto may also be desired when we want to have very small binary sizes, to combine opt-level="z" with lto=true.
We should probably mention this as well.
[profile.release]
lto=x
codegen-units=1
opt-level=y
size of cargo binary in bytes
monolithic lto | thinlto | |
---|---|---|
opt-level = 3 | 11709976 | 12637656 |
opt-level = "s" | 10059336 | 11355552 |
opt-level = "z" | 10315192 | 11501104 |
(iirc "z" should actually optimize for size even more aggressively than "s" so looks like something is a bit weird. :/ )
Last but not least we have the codegen units, and split up a crate into parts and compile it in parallel (before (thin)lto). I guess codegen-units = 1 is a bit like monolithic lto and codegen-units > 1 is a bit like thinlto. There are tickets out there which seem to indicate that several several codegen-units worsens performance (#47665 , #47745 ..) Currently it seems by default we split up every crate into several codegen units and compile it in parallel while at the same time compiling several crates in parallel. (This is kind of unnecessary parallelism in my opinion but chosing a more reasonable number of codegen-units taking the host machines cpu core count into account will mean we will only have reproducible builds on machines with identical core numbers which is also bad.... :( )
Please correct me if I'm wrong!!
Interesting links: thinlto: https://www.youtube.com/watch?v=p9nH2vZ2mNo thinlto: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html lto: https://llvm.org/docs/LinkTimeOptimization.html
The documentation team has been talking about creating new guide for the rustc
CLI, similar to the CLI section in the Rustdoc book. Tracking issue: https://github.com/rust-docs/team/issues/11. This might be a good place to talk about lto and codegen-units.
Update: this has now been merged, and lives here: https://github.com/rust-lang/rust/tree/master/src/doc/rustc
Update: this has now been merged, and lives here: https://github.com/rust-lang/rust/tree/master/src/doc/rustc
In fact, the codegen-units
and lto
options are already mentioned in the rustc book:
But after skimming through the previous comments in this issue, seems like there's room to expand the descriptions
There seems to be a lot of confusion about performance implications of lto, thinlto, codegen-units and default optimizations of build targets, maybe we can clarify this somehow.
Where would be the best place for this?