Add support for splitting linker invocation to a second execution of `rustc`

alexcrichton commented 5 years ago

This issue is intended to track support for splitting a rustc invocation that ends up invoking a system linker (e.g. cdylib, proc-macro, bin, dylib, and even staticlib in the sense that everything is assembled) into two different rustc invocations. There are a number of reasons to do this, including:

This can improved pipelined compilation support. The initial pass of pipelined compilation explicitly did not pipeline linkable compilations because the linking step needs to wait for codegen of all previous steps. By literally splitting it out build systems could then synchronize with previous codegen steps and only execute the link step once everything is finished.
This makes more artifacts cacheable with caching solutions like sccache. Anything involving the system linker cannot be cached by sccache because it pulls in too many system dependencies. The output of the first half of these linkable compilations, however, is effectively an rlib which can already be cached.
This can provide build systems which desire more control over the linker step with, well, more control over the linker step. We could presumably extend the second half here with more options eventually. This is a somewhat amorphous reason to do this, the previous two are the most compelling ones so far.

This is a relatively major feature of rustc, and as such this may even require an RFC. This issue is intended to get the conversation around this feature started and see if we can drum up support and/or more use cases. To give a bit of an idea about what I'm thinking, though, a strawman for this might be:

Add two new flags to rustc, --only-link and --do-not-link.
Cargo, for example, would first compile the bin crate type by passing the --do-not-link flag, passing all the flags it normally does today.
Cargo, afterwards, would then execute rustc again, only this time passing the --only-link flag.

These two flags would indicate to rustc what's happening, notably:

--do-not-link indicates that rustc should be creating a linkable artifact, such as a one of the ones mentioned above. This means that rustc should not actually perform the link phase of compilation, but rather it's skipped entirely. In lieu of this a temporary artifact is emitted in the output directory, such as *.rlink. Maybe this artifact is a folder of files? Unsure. (maybe it's just an rlib!)
The converse of --do-not-link, --only-link, is then passed to indicate that the compiler's normal phases should all be entirely skipped except for the link phase. Note that for performance this is crucial in that this does not rely on incremental compilation, nor does this rely on queries, or anything like that. Instead the compiler forcibly skips all this work and goes straight to linking. Anything the compiler needs as input for linking should either be in command line flags (which are reparsed and guaranteed to be the same as the --do-not-link invocation) or the input would be an output of the --do-not-link invocation. For example maybe the --do-not-link invocation emits an file that indicates where to find everything to link (or something like that).

The general gist is that --do-not-link says "prepare to emit the final crate type, like bin, but only do the crate-local stuff". This step can be pipelined, doesn't require upstream objects, and can be cached. This is also the longest step for most final compilations. The gist of --only-link is that it's execution time is 99% the linker. The compiler should do the absolute minimal amount of work to figure out how to invoke the linker, it then invokes the linker, and then exits. To reiterate again, this will not rely on incremental compilation because engaging all of the incremental infrastructure takes quite some time, and additionally the "inputs" to this phase are just object files, not source code.

In any case this is just a strawman, I think it'd be best to prototype this in rustc, learn some requirements, and then perhaps open an RFC asking for feedback on the implementation. This is a big enough change it'd want to get a good deal of buy-in! That being said I would believe (without data at this time, but have a strong hunch) that the improvements to both pipelining and the ability to use sccache would be quite significant and worthwhile pursuing.

alexcrichton commented 5 years ago

cc @cramertj and @jsgf, we talked about this at the RustConf and figured y'all would want to be aware of this

jsgf commented 5 years ago

Would the --only-link literally just invoke ld? If so, it might be useful to just be able to extract whatever extra libraries/options its adding to the link line, so we can independently regenerate the link line.

This would be useful for linking hybrid Rust/C++/(other) programs where the final executable is non-Rust. In other words, we could have C++ depending on Rust without needing to use staticlib/cdylib.

alexcrichton commented 5 years ago

I don't think it'd just be a thin wrapper around ld, no, but it would also prepare files to get passed to the linker. For example when producing a dylib rustc will unpack an rlib and make a temporary *.a without bytecode/metadata. Additionally if performing LTO this'd probably be the time we'd take out all the bytecode and process it. (maybe LTO throws a wrench in this whole thing). Overall though I don't think it's safe to assume that it'll just be ld.

jsgf commented 5 years ago

Firstly, we'd want the final linker doing LTO in order to get it cross-language, regardless of whatever language the final target is in and what mix of languages went into the target.

Secondly, since Buck has full dependency information, including Rust dependencies on C/C++, it will arrange for all the right libraries to be on the final link line. As a result we never want to use or honor #[link] directive, and our .rlibs don't contain anything to be unpacked.

(Even if that weren't true, at least on Unix systems, the .rlib is just a .a and could be used directly, except perhaps for the extension).

I like this proposal because it allows us to factor out the Rust-specific details from the language-independent ones. For example there's no reason for rustc to implement LTO if we're already having to solve that for other languages - especially when that solution pretty infrastructure-specific (distributed thin LTO, for example). There's also no real reason for us to use staticlib/cdylib if we can arrange for all the Rust linker parameters to be on the final link line, even if the final executable is C++, and it would be a significant code duplication reduction (unless LTO see it and eliminate it, but that's still a compile-time cost).

Ultimately, Rust lives in the world of linkable object files, and a final artifact is generated by calling the linker with a certain set of inputs. Since Rust doesn't have unusual requirements that make it incompatible with C/C++ linkage (eg special linkage requirements or elaborate linker scripts) then the final linker stage could be broadly language agnostic.

cramertj commented 5 years ago

+1 to wanting the ability to turn off / disable #[link].

michaelwoerister commented 5 years ago

I'm generally in favor of this. Some thoughts:

One of the computationally more heavy things that linking needs is the list of exported symbols (i.e. the linker script). Getting this list involves analyzing the HIR and reading upstream crate metadata. But it should be easy to serialize this information during the --do-not-link step and store it in the .rlink output.
Could we move all of LTO out of rustc? That would make things simpler for rustc but probably has some overhead. Also, as far as I know, llvm-ar does not support doing LTO, but for staticlibs one might want to have it (Firefox does this at least).

alexcrichton commented 5 years ago

I don't disagree that y'all's sorts of projects don't want to use the rustc-baked-in LTO, but I don't think we can remove it because many other projects do use it (and rightfully want to). Also this is still just a sort of high-level concept, but if a lot of feature requests are piled onto this it's unfortunately unlikely to happen.

0dvictor commented 4 years ago

Hi, my name is Victor and I'm workin with @tmandry. I am very interested in this Issue as it potentially enables many other great cool. Therefore, I plan to make a prototype and ultimately implement it. @alexcrichton , do you have any suggestions/tips to start?

alexcrichton commented 4 years ago

Great @0dvictor! The steps I'd recommend for doing this would probably look like:

Poke around rustc and the linking phase, and formulate a high-level plan of how you'd like to implement this feature.
Confirm with compiler team folks that the plan of action you've got is reasonable. The details probably wouldn't be fully fleshed out, but this is likely to be a large-ish change so the compiler folks will want to be onboard and it's best to start that early.
Iterate towards a working prototype (probably with compiler team help)
Start writing tests/etc
Evaluate with the compiler team at this point if the change needs an RFC or if it's good to land unstable in the compiler

As for the actual change itself I haven't looked too much into this, so I wouldn't know where best to start there.

tmandry commented 4 years ago

Some implementation notes:

Current state of things

The main run_compiler function calls Compiler::link, which ensures there is a codegen step running. It then dispatches to the trait method CodegenBackend::join_codegen_and_link.

For the LLVM backend (which is the only one right now), this method is implemented here. join_codegen_and_link joins all the threads running codegen, which save their results to an object file per thread (often named foo.bar.<hash>-cgu.<threadno>.rcgu.o on linux). The names of these object files are saved in the CodegenResults struct (specifically, look in CompiledModule).

Finally, it calls link_binary which has the main logic for invoking the linker.

Strategy

Obviously, we need to split apart all the code that assumes codegen and linking happen at the same time. This starts with the join_codegen_and_link trait method. Thankfully, it doesn't seem like there is too much code that assumes this, but there's still a question of what to do in the new code.

For the flags, we can start with unstable options (-Z no-link and -Z only-link) and later stabilize them via the RFC process. When the no-link flag is passed, we should not invoke link_binary anymore. When the only-link flag is passed, we need a way of recovering the information that was in our CodegenResults struct so we can call link_binary.

tmandry commented 4 years ago

In lieu of this a temporary artifact is emitted in the output directory, such as *.rlink. Maybe this artifact is a folder of files? Unsure. (maybe it's just an rlib!)

I don't much like using a directory as output, since some build systems might not support this.

Probably the best thing to do is to make an ar file (which I should note is what an rlib is today). I don't much care what the extension is (I like .rlink, since it probably gets handled differently than rlibs or staticlibs do today).

That said, the choice of extension should be up to whoever is invoking rustc, and we should use the -o flag to decide where our output goes.

I think there might be details we need to pay attention to regarding the linker's default behavior when linking object files vs archive files (like symbol visibility), but not sure what those details are. cc @petrhosek

bjorn3 commented 4 years ago

Bundling it together in a single ar file would do unnecessary work (both IO and CPU) Object files are always written to the disk. When building an ar file, they are copied then copied to the ar file and a symtab is created (ranlib). Creating a symtab can't be avoided if you dont want to unpack the ar file again before linking, as the linker requires a symtab to be present.

tmandry commented 4 years ago

I suppose that in build systems that need it, we can wrap invocations with a script that tars and untars the files, for example. I'm fine with having an output directory in that case. Ideally the compiler would support both, but that doesn't seem realistic for an initial implementation.

On Tue, Nov 19, 2019 at 10:58 AM bjorn3 notifications@github.com wrote:

Bundling it together in a single ar file would do unnecessary work (both IO and CPU) Object files are always written to the disk. When building an ar file, they are copied then copied to the ar file and a symtab is created (ranlib). Creating a symtab can't be avoided if you dont want to unpack the ar file again before linking, as the linker requires a symtab to be present.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/64191?email_source=notifications&email_token=AARMYYDGTTWB44PHNDV6OKTQUP5JBA5CNFSM4IUB7HWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEOPEEI#issuecomment-555545105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMYYFKOYAUILYEUDBUCB3QUP5JBANCNFSM4IUB7HWA .

0dvictor commented 4 years ago

I finally get a prototype working for the first part - generate a linkable object or bitcode file, as well as the linking command to invoke manually to finish linking. In addition, I also successfully ran LTO linking with a native lib.

While starting on the second stage, I found Rust is moving to LLD per https://github.com/rust-lang/rust/issues/39915. Can I make an assumption that I only need to support LLD or "clang/gcc -fuse-ld=lld"?

alexcrichton commented 4 years ago

While moving to LLD is nice, it's unlikely to happen any time soon, so it's best to not make the assumption of LLD.

tmandry commented 4 years ago

@0dvictor As a suggestion, you may want to file a PR that includes only the -Z no-link option while you work on finishing the implementation of -Z link-only. That way you can start getting feedback sooner and stay more in sync with mainline. But you may decide it's not worth the trouble.

rustc flags are in src/librustc/session/config.rs. See e.g. this PR for an example.

0dvictor commented 4 years ago

@0dvictor As a suggestion, you may want to file a PR that includes only the -Z no-link option w

Good idea, let me polish my changes and create a PR.

0dvictor commented 4 years ago

Sorry about the delay.

After studying the code and making some experiments, I found all linker arguments comes from the following four sources:

Target specs: e.g. linker, pre/post link args/objects, etc.
CLI of rustc: e.g. -L -C relocation-model, -C link-args, -Z pre-link-args, etc.
Compiled modules, include allocator and metadata when necessary;
Dependent libraries (rlib and native)
Crate source code: i.e. [link_args = "-foo -bar -baz"]

At linking stage, assume user always pass the required CLI arguments:

(1) and (2) can be constructed without any information from compiling stage.
(3) is the object or bitcode files from compiling state.
- i.e. --emit=object or --emit=llvm-bc
(4) is the information contained in a rmeta file
- Can be obtained via --emit=metadata
(5) is not part of any generated files from compiling stage
- Need somehow be passed to linking stage from compiling stage

Therefore, in my experiment, to compile without linking is basically: --emit=metadata,object` or `--emit=metadata,llvm-bc The user can choose to generate either an object file or LLVM bitcode file.

I have to make the following three changes to get it to work [PR #67195]:

Generate and save the .bc/.o/.ll/.s file of the allocator and metadata (if needed) when user requests --emit=llvm-bc/object/llvm-ir/asm
- [I feel this should be the expected behavior instead of completely ignoring allocator and metadata. I was very confused before knowing the .bc/.o/.ll/.s file does not contain all code compiled from the rust source code.]
Write the metadata to file when user requests --emit=metadata even if the OutputType is OutputType::Exe.
- [I also feel this should be the expected behavior.]
Skip linking.

To minimize the impact of existing code, all changes are guarded by -Z no-link.

I have not included (5) yet. My plan is to save it in either the rmeta file, or the bitcode/object file using LLVM’s !llvm.linker.options. I prefer the latter as we can get it for free for targets using LLD. (Of course we still have to generate corresponding linker args for targets do not use LLD)

If we want one single .rlink file, we can ar the .rmeta and .bc/o files generated in compiling stage.

0dvictor commented 4 years ago

Then for the linking stage, I plan to insert code here to read the .rmeta or .rlink file, resolve all dependencies then reconstruct the CodegenResults, so that we can create and execute the linker.

0dvictor commented 4 years ago

Finally, some thoughts on LTO: once this Issue finishes, we should be able to do LTO easily when we use LLD (either directly or via clang/gcc -fuse-ld=lld). I have successfully run LTO linking a rust crate and a llvm bitcode file generated by clang -flto -c. However, LLD only takes uncompressed bitcode files (either by itself or residing inside a .a file). Rust rlib only contains compressed ones, so that we would have to extract and uncompressed them before sending to LLD.

Out of curiosity, why does an rlib contains both native object file and LLVM bitcode file? Is it because the time cost of “LLVM bitcode => native object” is too expensive? Otherwise, we only need to save LLVM bitcode in an rlib and generate native objects when needed.

mati865 commented 4 years ago

Out of curiosity, why does an rlib contains both native object file and LLVM bitcode file? Is it because the time cost of “LLVM bitcode => native object” is too expensive? Otherwise, we only need to save LLVM bitcode in an rlib and generate native objects when needed.

https://github.com/rust-lang/rust/issues/66961

tmandry commented 4 years ago

Generate and save the .bc/.o/.ll/.s file of the allocator and metadata (if needed) when user requests --emit=llvm-bc/object/llvm-ir/asm

[I feel this should be the expected behavior instead of completely ignoring allocator and metadata. I was very confused before knowing the .bc/.o/.ll/.s file does not contain all code compiled from the rust source code.]

Write the metadata to file when user requests --emit=metadata even if the OutputType is OutputType::Exe.

[I also feel this should be the expected behavior.]

Skip linking.

To minimize the impact of existing code, all changes are guarded by -Z no-link.

Yeah, those should probably be the default. Would you mind opening an issue to track this?

0dvictor commented 4 years ago

Yeah, those should probably be the default. Would you mind opening an issue to track this?

Here they are: #67292 and #67293

I also created #67294 to track a different issue found while I am working on splitting compiling and linking.

bjorn3 commented 4 years ago

The rustc part of this has been merged. This is waiting on cargo support now.

luser commented 4 years ago

Bundling it together in a single ar file would do unnecessary work (both IO and CPU) Object files are always written to the disk. When building an ar file, they are copied then copied to the ar file and a symtab is created (ranlib). Creating a symtab can't be avoided if you dont want to unpack the ar file again before linking, as the linker requires a symtab to be present.

Note that for large projects this overhead adds up. GNU ar added a "thin archive" mode a while back which just stores the paths to the object files instead of copying them all in. We hand-rolled something like this for Firefox as well. It may not be as bad as the C/C++ case since rlibs don't have as many object files in them but it's worth looking into. From sccache's perspective it'd probably be fine if the rlink file just contained paths to the rlibs to use and maybe specific object file names or byte ranges or whatever.

Also I just wanted to clarify slightly this bit from the original comment:

Anything involving the system linker cannot be cached by sccache because it pulls in too many system dependencies.

The reason sccache can't cache compiles that invoke the system linker is because it doesn't have code to interrogate linkers about what their inputs and outputs for a specific link would be. sccache needs to be able to know all the inputs and outputs of compilation in order to generate correct cache keys and store the results in the cache. I looked into this a little bit a while ago and it's probably doable, just complex. (It'd probably wind up having to parse linker scripts from ld, for example.)

jsgf commented 4 years ago

it doesn't look like this emits an artifact notification for .rlink files. It might also be useful if it emitted notifications for all the object files as well - otherwise a build tool which is trying to keep track of all the emitted objects would need to parse the json (presumably not a stable format at this point) to work out what they all are.

tmandry commented 4 years ago

@0dvictor and I were discussing how to collect the input files. We decided the easiest way was in fact to archive them, and that we would profile the performance impact in Fuchsia to see if it was a problem.

This approach has the nice property that you know exactly how many output files there will be and what their names are, which makes integrating into existing systems easier. Given that this unblocks using Rust in our distributed build system, it should still be a win for performance.

That said, it sounds from @luser's comment like the overhead might be more than I had originally thought, and we might want support for N output files at some point. We may end up with a solution that supports both archived and non-archived outputs.

Emitting the names in artifact notifications was another possibility that came up. It seems correct to me that they would be emitted, but I haven't thought about it too carefully. We don't use these in our build system today, though we could probably teach it to use them with a bit of effort. Still, it will require more work to handle in our distributed build system, which is why we didn't jump for that option first.

jsgf commented 4 years ago

Yeah, a heavyweight ar would add a lot of unnecessary cost in extra copying. A thin archive would be fine though.

Is there a plan to stabilize/document the format of .rlink files? It looks like it has everything needed to directly generate a linker line, which might be preferable for us.

jsgf commented 4 years ago

rustc -Zlink-only thing.rlink deletes its input files as part of its execution. That seems... strange.

bjorn3 commented 4 years ago

-Zlink-only invokes the linker code like normal. Only when -Csave-temps is used, does the linker code not cleanup object files.

tmandry commented 4 years ago

I don't think we want to stabilize the rlink format anytime soon. It's basically a JSON representation of an internal Rust data structure.

Thin archives or json directives seem like a better bet. Or a distributed build worker can scan the output directory for any new files and send them to the client, rsync-style.

jsgf commented 4 years ago

@bjorn3 Sure, but they're not really internal tmp files in the no-link/link-only flow - they're outputs of one build step and inputs of the next, and build steps don't generally trash their inputs. You'd expect to be able to re-run link-only without needing to redo no-link.

Consider, for example, where the input files are paths into a shared cache where deleting them removes them from or corrupts the cache. The current behaviour would require them to be copied to avoid this.

jsgf commented 4 years ago

@tmandry Nope to scanning a dir, but a thin archive is a good starting point.

However, I was interested in the other info in that file, like the paths to other libraries which need to be linked in - though TBH I'm not sure that's the best way to receive that information.

bjorn3 commented 4 years ago

Sure, but they're not really internal tmp files in the no-link/link-only flow - they're outputs of one build step and inputs of the next, and build steps don't generally trash their inputs.

After the linking phase, when there is no shared cache, you will likely want to delete those intermediate files. Those intermediate files don't have predictable names, so cargo or other programs won't be able to remove them. Simply searching for .o files won't work, as other kinds of files are created and new kinds may be created in the future.

they're outputs of one build step and inputs of the next

It is very unlikely that you want to run either of the parts without the other. If any input file is changed, you need to run -Zno-link to handle the changes. If you run -Zno-link, you will also want to run -Zlink-only to perform the linking. If you didn't, you could just have used --emit metadata instead like cargo check does. The only reason -Zno-link and -Zlink-only are separated, is because you would otherwise need to notify rustc that all dependencies are fully built.

However, I was interested in the other info in that file, like the paths to other libraries which need to be linked in - though TBH I'm not sure that's the best way to receive that information.

Use -Zbinary-dep-depinfo to put all crate dependencies in the --emit dep-info file.

jsgf commented 4 years ago

@bjorn3 In a distributed build environment, the no-link and link-only steps are likely happening on different machines, with the intermediates being copied between them (or via some shared content store).

After the linking phase, when there is no shared cache, you will likely want to delete those intermediate files.

No, they're still useful for subsequent links if you're rebuilding the target executable because something else (eg non-Rust) changed.

You may just want to generate multiple versions of the same executable from the same inputs (eg built with different linker scripts, or static vs dynamic for system libraries).

bjorn3 commented 4 years ago

In a distributed build environment, the no-link and link-only steps are likely happening on different machines, with the intermediates being copied between them (or via some shared content store).

Many of the paths in the .rmeta file are absolute, so you will need to copy everything to the exact same location on the other machine. Also you don't know what files you have to copy. Even with -Zbinary-dep-depinfo the names of the intermediate files are not mentioned in neither the depinfo nor the artifact messages, and they are merely an implementation detail. In theory rustc could even put those intermediate files in /tmp. (For example if it detects that /tmp is a tmpfs, and the files fit in memory)

No, they're still useful for subsequent links if you're rebuilding the target executable because something else (eg non-Rust) changed.

You may just want to generate multiple versions of the same executable from the same inputs (eg built with different linker scripts, or static vs dynamic for system libraries).

When changing linker flags (this includes static vs dynamic for system libraries), you will need to call -Zno-link again to generate a new .rlink file with those linker flags.

luser commented 4 years ago

We decided the easiest way was in fact to archive them, and that we would profile the performance impact in Fuchsia to see if it was a problem.

I'm generally in favor of implementing the straightforward thing and profiling to see if it's actually a problem. I just wanted you to be aware of existing work in this area.

Many of the paths in the .rmeta file are absolute, so you will need to copy everything to the exact same location on the other machine. Also you don't know what files you have to copy. Even with -Zbinary-dep-depinfo the names of the intermediate files are not mentioned in neither the depinfo nor the artifact messages, and they are merely an implementation detail.

Given that sccache was mentioned in the initial comment here, if this is going to be cache-friendly we do need to have a way to know what the full set of inputs and outputs will be for a specific command otherwise we can't reliably store a result in cache that we can use later.

jsgf commented 4 years ago

@bjorn3 I see this work as a good start, but an intermediate step to:

the .rlink is a well-defined file with a documented schema
a correct linker line can be generated from the .rlink without requiring the use of rustc -Zlink-only

Step 1 is necessary because we need to be able to extract the path info, and rewrite it to remove/restructure absolute paths. The whole value in splitting linking is so we can run compilation of bin/dylib/procmacro crates and their linkage as separate build steps. "Splitting" them but then still requiring them to be run 1:1 on the same machine doesn't help all that much.

Step 2 is necessary so we can avoid using static and cdylib crate types as they're painful to use in hybrid Rust/C++ programs where main() is not Rust (ie, we want to put the path to libstd on the final link line, not link it in multiple times for each C++ -> Rust dependency)..

bjorn3 commented 4 years ago

the .rlink is a well-defined file with a documented schema

This will never happen. There are many implementation details written to .rlink. For example:

[...]
      "compiler_builtins" : 3,
[...]
      "missing_lang_items" : {
         "14" : [],
         "8" : [
            "EhPersonalityLangItem"
         ],
[...]
      },
      "panic_runtime" : 14,
      "profiler_runtime" : null,
[...]
   "metadata_module" : null,
   "allocator_module" : {
      "bytecode" : null,
      "name" : "33dyzt1ekirinwy8",
      "kind" : "Allocator",
      "object" : "rust_out.33dyzt1ekirinwy8.rcgu.o",
      "bytecode_compressed" : null
   },
[...]
   "modules" : [
      {
         "name" : "rust_out.7rcbfp3g-cgu.0",
         "bytecode" : null,
         "object" : "rust_out.rust_out.7rcbfp3g-cgu.0.rcgu.o",
         "bytecode_compressed" : null,
         "kind" : "Regular"
      },
[...]

compiler_builtins is an implementation detail and could technically be merged with libcore. Lang items are an implementation detail. panic_runtime and profiler_runtime are implementation details and could maybe in the future be replaced with normal deps. The exact things known about modules could change. For example bytecode_compressed may be removed once bytecode for LTO is no longer emitted alongside regular object files. metadata_module and allocator_module may in the future be folded into modules. ... (note: all things except for bytecode_compressed are purely hypothetical)

Step 1 is necessary because we need to be able to extract the path info, and rewrite it to remove/restructure absolute paths. The whole value in splitting linking is so we can run compilation of bin/dylib/procmacro crates and their linkage as separate build steps. "Splitting" them but then still requiring them to be run 1:1 on the same machine doesn't help all that much.

The splitting is mostly meant to be able to perform pipelined compilation on the final artifact, not to perform it codegen and linking on different machines. What is even the benefit of performing codegen and linking on different machines by the way? It doesn't increase parallelism, as the linking has to wait on codegen anyway.

Step 2 is necessary so we can avoid using static and cdylib crate types as they're painful to use in hybrid Rust/C++ programs where main() is not Rust (ie, we want to put the path to libstd on the final link line, not link it in multiple times for each C++ -> Rust dependency)..

That would require making the linker arguments stable, as changes could break such thing. Also why not create one rust library that depends on all rust dependencies and then depend on that rust library from C++?

jsgf commented 4 years ago

The splitting is mostly meant to be able to perform pipelined compilation on the final artifact, not to perform it codegen and linking on different machines. What is even the benefit of performing codegen and linking on different machines by the way? It doesn't increase parallelism, as the linking has to wait on codegen anyway.

Sure, but from our point of view we want to break the build up into atomic actions with well-defined inputs and outputs and then be able to freely schedule them across a build cluster. I don't want to have to treat Rust build + link as a special case - adding a constraint that they have to execute on the same machine would make it much harder to schedule.

But I think if we can use a thin ar to logically bundle them all up, it will be managable.

This will never happen. There are many implementation details written to .rlink. For example:

OK, so I think there's too much stuff in the .rlink file then. If we assume that the no-link and link-only invocations are passed an identical set of flags (aside from the -Z option itself), then all the other details are regeneratable and don't need to be in the .rlink. The .rgcu.o files are the only thing that's uniquely useful to pass from the first invocation to the second.

Also why not create one rust library that depends on all rust dependencies and then depend on that rust library from C++?

That doesn't scale. There could be hundreds of C++ libraries linked in, any of which could be using some combination of Rust libraries.

bjorn3 commented 4 years ago

then all the other details are regeneratable and don't need to be in the .rlink

No, link args can come from #[link] attributes on extern blocks. Because of proc-macros those can be generated non-deterministically.

jsgf commented 4 years ago

No, link args can come from #[link] attributes on extern blocks. Because of proc-macros those can be generated non-deterministically.

Fair enough - they can be encoded in the .rlink too. But as they won't work in our environment we'll want to ignore them or complain if they're present (that would effectively be a proc-macro making up a dependency the build system doesn't know about which is a big nono - and if the build system does know about it, we don't need the #[link]).

tmandry commented 4 years ago

What is even the benefit of performing codegen and linking on different machines by the way? It doesn't increase parallelism, as the linking has to wait on codegen anyway.

Linking, in theory, depends on a lot more artifacts than codegen does. Codegen should only require source code and the rmeta files from any crates you depend on. Linking requires all the generated code. In our sccache-like environment, this would mean uploading many rlib files and possibly system libraries to the worker. Network bandwidth becomes a bottleneck. So it's much better to send compile steps to the workers, hitting cache when possible, and do linking locally.

That would require making the linker arguments stable, as changes could break such thing.

Link args don't need to be stable, just the file format which contains them. I don't think the fact that the file contains references to implementation details like compiler-builtins is a problem, actually. As long as those details can change without changing the schema, a well-written tool should be able to consume them without breakage.

That said, stabilizing rlink seems more ambitious than having a rustc option which spits out the final linker line, allowing you to run it yourself.

0dvictor commented 4 years ago

Apologize for my such late reply.

After experimenting the archiving approach (using ar) to collect the linker's input files, I found there are several drawbacks, which made me believe ar is not the correct approach:

Generating a regular .a file requires extra storage, CPU, and IO resources
- Effectively doubles the peak storage and IO usages
- What is worse: when crate-type is a rlib or staticlib, the .a file must be extracted before passing to the linker
Creating "thin archive" does not solve the fundamental issue that a distributed build system has to deal with unknown number of files
Not all platform supports "thin archive"
- E.g. I don't think msvc's link.exe supports it

Similarly, simply archiving the files with tar or zip or even somehow embedding them into a .rlink file would not be ideal either.

Therefore, I am experimenting a different approach: after finishing LLVM's optimizations, linking all CGUs into a combined one using llvm::Linker; so that only one .o file is generated for the crate. (Of course, there will be potentially another two .o files: the allocator and the metadata). Therefore, the distributed build system only has to deal with a known number of .o files: one for the main crate, zero or one for the allocator, and zero or one for the metadata. If -Z human-readable-cgu-names is also enabled, these .o files can be easily identified. The main benefits comparing to the archiving approach would be:

Platform independent
Zero change required on linker side
No extra storage or IO cost
- Not reading/writing more data from/into disk
Unlikely to increase rustc's peak memory usage
- Linking with llvm::Linker is expected to use much less memory than optimizations; therefore, performing linking after optimizations is unlikely to exceed the memory footprint during optimization stage.
Bonus: also enables parallelism for --emit [llvm-bc|llvm-ir|asm|obj]

WDYT?

bjorn3 commented 4 years ago

Therefore, I am experimenting a different approach: after finishing LLVM's optimizations, linking all CGUs into a combined one using llvm::Linker

That won't work for non LLVM based backends.

Linking with llvm::Linker is expected to use much less memory than optimizations; therefore, performing linking after optimizations is unlikely to exceed the memory footprint during optimization stage.

What about a crate with many codegen units? During optimizations only a few codegen units are in memory at any time, while during linking with llvm::Linker, all of them have to be in memory. Also llvm::Linker links bitcode I believe, which means that all passes after the user facing LLVM-ir are forced to be done in a single thread after the linking. Especially in debug mode I expect this to be a significant performance regression.

0dvictor commented 4 years ago

That won't work for non LLVM based backends.

Correct, but we can implement similar, if not the same, way to combine the CGUs for that non-LLVM backend.

What about a crate with many codegen units? During optimizations only a few codegen units are in memory at any time, while during linking with llvm::Linker, all of them have to be in memory. Also llvm::Linker links bitcode I believe, which means that all passes after the user facing LLVM-ir are forced to be done in a single thread after the linking. Especially in debug mode I expect this to be a significant performance regression.

Great points. I should've been clear that this proposed feature would be guarded by an option, say -Z combine-cgus, and even if the feature is stablized, user should be able to turn if on/off; so that user is able to pick the best behavior that fits their usecases. In other words, the new feature should not introduce regressions to existing workloads.

Regardless of splitting out the linker invocation, Being able to generate a single or defined number of .o files eases many tasks. The distributed compiling tool is one example: the tool would not have to parse the .rlink and/or other files to figure out the undefined number of files need be copied from the server to the client. This also means rustc does not have to provide a stable .rlink file format. Other examples include tasks that run with --emit [llvm-bc|llvm-ir|asm|obj]. As these workloads require a single .[bc|ll|asm|o] output file, currently they are done in a single thread. Enabling parallelism for these tasks should yield a good performance boost.

tmandry commented 4 years ago

Can llvm::Linker not link both bitcode and object code? IIUC, it would indeed regress performance quite a bit in that case, since LLVM IR -> Object code is one of the slowest compilation steps.

@0dvictor, can we compare the build time using the flag in your working branch versus -Ccodegen-units=1, versus a baseline? That would help indicate if this flag is worth maintaining.

0dvictor commented 4 years ago

Can llvm::Linker not link both bitcode and object code? IIUC, it would indeed regress performance quite a bit in that case, since LLVM IR -> Object code is one of the slowest compilation steps.

Unfortunately, llvm:Linker is only for linking bitcode.

can we compare the build time using the flag in your working branch versus -Ccodegen-units=1, versus a baseline? That would help indicate if this flag is worth maintaining.

Good idea, let me do that.

adetaylor commented 4 years ago

@jsgf The project I'm working on seems to share similar properties to yours:

Using a non-Cargo build system based on static dependency resolution (and has 20000+ targets)
Final linking performed by an existing C++ toolchain
A few Rust .rlibs scattered throughout a very deep dependency tree, which may eventually roll up into one or multiple binaries

We can't:

Switch from our existing linker to rustc for final linking. C++ is the boss in our codebase; we're not ready to make the commitment to put Rust in charge of our final linking.
Create a Rust static library for each of our .rlibs. This works if we're using Rust in only one place. For any binary containing several Rust subsystems, there would be binary bloat and often violations of the one-definition-rule
Create a Rust static library for each of our output binaries. The build directives for the Rust .rlibs don't know what final binaries they'll end up in; and the build directives for the final binaries don't know what .rlibs they've pulled in from deep in their dependency tree. Our build system forbids that sort of global knowledge, or it would be too slow across so many targets.
Create a single Rust static library containing all our Rust .rlibs. That monster static library would depend on many C++ symbols, so each binary would become huge (and in fact not actually link properly)

Therefore we want to link rlibs directly into our final linker invocation. This is, in fact, what we're doing, but we have to add some magic:

The final C++ linker needs to pull in all the Rust stdlib .rlibs, which would be easy apart from the fact they contain the symbol metadata hash in their names.
We need to remap __rust_alloc to __rdl_alloc etc.
In future the final rustc-driven linker invocation might add extra magic.

Therefore one of the solutions you and @tmandry suggest would be awesome: either stabilizing .rlink (which sounds unlikely) or having some official way to build a linker command line which is not under the control of rustc (-Zbinary-dep-depinfo helps a bit).

I think, though, that this request is a little bit orthogonal to this issue. I wonder if we should submit a new issue? I think it's quite a big request.

tmandry commented 4 years ago

That would be useful and it's also orthogonal. A new issue sounds like a good idea.

On Fri, May 1, 2020 at 4:17 PM adetaylor notifications@github.com wrote:

@jsgf https://github.com/jsgf The project I'm working on seems to share similar properties to yours:

Using a non-Cargo build system based on static dependency resolution (and has 20000+ targets)

Final linking performed by an existing C++ toolchain

A few Rust .rlibs scattered throughout a very deep dependency tree, which may eventually roll up into one or multiple binaries

We can't:

Switch from our existing linker to rustc for final linking. C++ is the boss in our codebase; we're not ready to make the commitment to put Rust in charge of our final linking.

Create a Rust static library for each of our .rlibs. This works if we're using Rust in only one place. For any binary containing several Rust subsystems, there would be binary bloat and often violations of the one-definition-rule

Create a Rust static library for each of our output binaries. The build directives for the Rust .rlibs don't know what final binaries they'll end up in; and the build directives for the final binaries don't know what .rlibs they've pulled in from deep in their dependency tree. Our build system forbids that sort of global knowledge, or it would be too slow across so many targets.

Create a single Rust static library containing all our Rust .rlibs. That monster static library would depend on many C++ symbols, so each binary would become huge (and in fact not actually link properly)

Therefore we want to link rlibs directly into our final linker invocation. This is, in fact, what we're doing, but we have to add some magic:

The final C++ linker needs to pull in all the Rust stdlib .rlibs, which would be easy apart from the fact they contain the symbol metadata hash in their names.

We need to remap __rust_alloc to __rdl_alloc etc.

In future the final rustc-driven linker invocation might add extra magic.

Therefore one of the solutions you and @tmandry https://github.com/tmandry suggest would be awesome: either stabilizing .rlink (which sounds unlikely) or having some official way to build a linker command line which is not under the control of rustc ( -Zbinary-dep-depinfo helps a bit).

I think, though, that this request is a little bit orthogonal to this issue. I wonder if we should submit a new issue? I think it's quite a big request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/64191#issuecomment-622607220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMYYFU42EPAMD5YNWZ5XLRPNJ77ANCNFSM4IUB7HWA .

rust-lang / rust

Add support for splitting linker invocation to a second execution of `rustc` #64191

Recommended reading

Current state of things

Strategy