rui314 / mold

Mold: A Modern Linker 🦠
MIT License
13.69k stars 448 forks source link

Proposal: split debug symbols #1294

Open jzwinck opened 1 week ago

jzwinck commented 1 week ago

Making release builds with debug symbols (e.g. -O2 -g) is common and produces large executables. It is standard industry practice to split debug symbols into a separate file which does not need to be deployed to most machines. See https://github.com/GabrielMajeri/separate-symbols.

While the objective is well known and valuable, the process is arcane and inefficient. Linkers produce an executable file which is immediately read back from disk using separate tools and then written to the final files. Mold could do this better.

Benefits:

  1. Ease of use for people who currently find it troublesome or aren't even aware it's possible on Linux (it's much easier using MSVC).
  2. Speed. Avoids writing and reading the full executable on disk. Also, Mold is better at parallelism and reusing data in memory vs existing tools which are single-threaded and read (parts of) the executable multiple times.
  3. Mold gains a competitive advantage over other linkers (at least until they add this, which would only be a good thing).
  4. Reduced disk usage during the build by avoiding writing the large but ultimately unnecessary full executable.

Do you like this idea or see pitfalls? Feedback very welcome.

rui314 commented 1 week ago

I'm interested in this, not only because it's convenient but also because it could potentially improve the linker's overall speed for debug builds. Let me experiment with some ideas.

Ext3h commented 3 days ago

There is a slight pitfall in deciding which symbols / sections are supposed to go into which file and which is supposed to be stripped.

There's a difference between a library which is intended for open source distribution (thus the full symbol file can be made available, and the binary can be stripped to the bare minimum), a closed source library which is intended for embedding into a 3rd party process (thus the original developer needs a full symbol file to provide support, but also minimal debugging information left in the PE executable to enable the 3rd party to walk stack frames!), and a self-contained executable which can be stripped entirely (including relocation information) and everything can be placed in the symbols.

That referenced call sequence with objcopy covers the first case only. The second case requires a finer grained control about which section should go into which file, and that is also dependent on the actual debug format used.

Long story short, you may need more than one output file for symbols, and you need a fine grained control where which sections goes, as well as by which naming convention the debug links / supplementary file links are set up.

There is also the interaction with --compress-debug-sections to consider. You may need different compression algorithms / choices for the embedded and external symbols. E.g. sticking with widely supported LZMA for the embedded line info, but wanting ZSTD for the private, external symbol file.

rui314 commented 3 days ago

We don't need to support all of the use cases. For complex use cases, it would probably be better to stick with post-link editing tools such as objcopy. I'm interested in implementing this because it could speed up the normal linking by separating debug info to another file.

jzwinck commented 2 days ago

@Ext3h Thank you for writing all that out. Personally I have only ever seen people using what you call the first case. I am not sure what the difference is between your third case and first case. I understand the motivation behind the second case, but it is more complex and it seems reasonable to implement the simpler case first.

@rui314 Thank you as well, I agree with your points and it would be great if the normal linking process becomes faster due to this--that would be even better than what I expected which was merely that the overall build time would decrease by removing post-processing.