rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.09k stars 12.69k forks source link

clarify effects of lto, thinlto and codegen-units #48518

Open matthiaskrgr opened 6 years ago

matthiaskrgr commented 6 years ago

There seems to be a lot of confusion about performance implications of lto, thinlto, codegen-units and default optimizations of build targets, maybe we can clarify this somehow.

Where would be the best place for this?

teiesti commented 6 years ago

I really appreciate the idea to improve documentation on this front. My current sources of information are

Maybe, we can use this issue to collect all the places where information about optimizing Rust executable can be found. In the end we could probably write a section (an appendix?) for TRPL.

We could also collect questions, e.g.

  1. What is the difference between ThinLTO and LTO? How do they interact?
  2. Do I need to enable ThinLTO or is it done by default?
  3. How does ThinLTO work?

(I've actually long wanted to find an answer to these questions.)

Edit: I've just found the answers. See also.

teiesti commented 6 years ago

After rethinking it, I don't believe a section in TRPL would be a good idea!

matthiaskrgr commented 6 years ago

I think the important items (and their relations) to explain are

lto thinlto
codegen-units = 1
codegen-units = n
opt-level = {0-3, ("s", "z")}

It's also notable that releases (1.24, 1.25, 1.26) have different behaviours/bugs (for example #48163 sped up monolithic lto link time (compiletime) significantly).

From what I understand, monolithic lto merges all the object files into one huge translation unit and the just pretends we have the entire program inlined into a single file while running its optimizations on everything at once sequentially.

Thinlto, while compiling, writes interesting metadata for modules/functions (lets call it snippets) into an index. While doing the link time optimizations, it optimizes snippets in parallel while only loading into ram related snippets metadata (and not everything) which makes it use less memory than monolithic lto (no need to load everything at once) and scalable (optimize N snippets at a time). In the future we might even get incremental thinlto (only reoptimize snippets that or whose dependencies changed, see #47660 ). During a talk on thinlto it was said thinlto only performs a subset of the optimizations that monolithic lto is doing, however since it is lean in memory usage and optimizes in parallel, it can do its optimizations more aggressively without noticeable increase in compiletime (or out of memory exceptions :P )

By default cargo build --release builds with opt-level=3, however lto may also be desired when we want to have very small binary sizes, to combine opt-level="z" with lto=true. We should probably mention this as well.

[profile.release]
lto=x
codegen-units=1
opt-level=y

size of cargo binary in bytes

monolithic lto thinlto
opt-level = 3 11709976 12637656
opt-level = "s" 10059336 11355552
opt-level = "z" 10315192 11501104

(iirc "z" should actually optimize for size even more aggressively than "s" so looks like something is a bit weird. :/ )

Last but not least we have the codegen units, and split up a crate into parts and compile it in parallel (before (thin)lto). I guess codegen-units = 1 is a bit like monolithic lto and codegen-units > 1 is a bit like thinlto. There are tickets out there which seem to indicate that several several codegen-units worsens performance (#47665 , #47745 ..) Currently it seems by default we split up every crate into several codegen units and compile it in parallel while at the same time compiling several crates in parallel. (This is kind of unnecessary parallelism in my opinion but chosing a more reasonable number of codegen-units taking the host machines cpu core count into account will mean we will only have reproducible builds on machines with identical core numbers which is also bad.... :( )

Please correct me if I'm wrong!!

Interesting links: thinlto: https://www.youtube.com/watch?v=p9nH2vZ2mNo thinlto: http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html lto: https://llvm.org/docs/LinkTimeOptimization.html

frewsxcv commented 6 years ago

The documentation team has been talking about creating new guide for the rustc CLI, similar to the CLI section in the Rustdoc book. Tracking issue: https://github.com/rust-docs/team/issues/11. This might be a good place to talk about lto and codegen-units.

steveklabnik commented 6 years ago

Update: this has now been merged, and lives here: https://github.com/rust-lang/rust/tree/master/src/doc/rustc

frewsxcv commented 6 years ago

Update: this has now been merged, and lives here: https://github.com/rust-lang/rust/tree/master/src/doc/rustc

In fact, the codegen-units and lto options are already mentioned in the rustc book:

But after skimming through the previous comments in this issue, seems like there's room to expand the descriptions