proposal: using weak symbols instead of binary rewriting

emillon commented 8 months ago

Context

Some dune features like dune-build-info rely on a mechanism called artifact substitution: a string placeholder is put in the program, and a later point it is replaced in the binary. At runtime, the program parses the string and determines if it is the placeholder, or extracts structured data from it.

Problems with the current approach

This approach works pretty well in practice, but has caused issues in the past on macos (#5650, #6226, #8360, etc). The reason is that under System Integrity Protection, only executables that have been produced by a compiler can be executed. This is implemented by attaching a signature to the file. When we modify the executable, we break this: the signature is invalid and need to be updated by running the codesign tool provided by macos.

Weak symbols

ELF has the notion of weak symbols (Mach-O and MSVC have similar concepts). It allows defining a symbol as "weak" (by default, symbols are "strong"). At link time, if a symbol has only a weak definition, it is used; but if a symbol has both a weak and a strong definition, the strong one is used (it is an error to link with several strong definitions). This makes it possible to change the behavior of a program by linking an extra object.

While it is not possible to specify that an OCaml symbol should be exposed as weak, it is possible to use that concept through C stubs and (extra_objects). The following cram test demonstrates this.

  $ cat > dune-project << EOF
  > (lang dune 3.5)
  > EOF

  $ cat > dune << EOF
  > (library
  >  (name lib)
  >  (modules)
  >  (foreign_stubs
  >   (language c)
  >   (names stubs)))
  > 
  > (executable
  >  (name main)
  >  (modules main)
  >  (libraries lib))
  > 
  > (executable
  >  (name main2)
  >  (modules main2)
  >  (libraries lib)
  >  (extra_objects extra))
  > 
  > (rule
  >  (copy main.ml main2.ml))
  > 
  > (rule
  >  (target extra.o)
  >  (action
  >   (run %{cc} -o %{target} -c %{dep:extra.c})))
  > EOF

  $ cat > main.ml << EOF
  > external get_message : unit -> string = "get_message"
  > let () = print_endline (get_message ())
  > EOF

  $ cat > stubs.c << EOF
  > #include <caml/memory.h>
  > #include <caml/alloc.h>
  > 
  > #pragma weak message
  > const char* message = "default value";
  > 
  > value get_message(value unit)
  > {
  >     CAMLparam1(unit);
  >     CAMLreturn(caml_copy_string(message));
  > }
  > EOF

  $ cat > extra.c << EOF
  > const char* message = "overridden value";
  > EOF

  $ dune exec ./main.exe
  default value
  $ dune exec ./main2.exe
  overridden value

This mechanism can be used to implement dune-build-info: by default, a weak "no data" symbol would be linked in (returning None at runtime), but to get the final version of the executable, dune would generate the actual data (git describe etc) into a C file and relink the executable with the extra object.

Benefits

The main benefit of this approach is that we don't need to edit opaque binary files. This would ensure that all executables that are run were linked by ld, which prevents the need for codesign to be invoked by hand. Macos and other systems are likely to tighten the security requirements so this ensures we're not having to play a cat-and-mouse game trying to circumvent this.

Limitations and problems

This relies on a non-portable feature of a C compiler, but one that is documented and supported by the most common toolchains since at least 20 years or so. And it is meant to replace a non-portable hack. The feature is not completely the same on Linux, macos and Windows, but since it's about replacing a whole object file instead of a single symbol it should work fine. It requires re-linking the executable instead of just making streaming copy, which could be slow for large executables (though the extra cost is only paid at promotion time). This approach only works for native executables. For bytecode and JS, a search-and-replace solution is still required. In the case of bytecode, it is possible to imagine a smarter replacement tool that unpacks the bytecode executable, alters just the required symbol and repacks it. Which brings us to the main challenges:

the semantics of "copying a file and editing at the same time" and "linking a different variant" are different since it is necessary to carry the context
we can't support only this strategy, so we'll have to implement different strategies (weak symbols for native, substitution for JS, smart edition or substitution for bytecode). So one difficulty is going to be finding the right abstraction on top of them. The resulting rules are likely to be more complicated than the current situation.

nojb commented 8 months ago

For bytecode and JS, a search-and-replace solution is still required. In the case of bytecode, it is possible to imagine a smarter replacement tool that unpacks the bytecode executable, alters just the required symbol and repacks it.

To make sure I understand, when you say "JS" here you mean the non-JSOO backends, right?

The feature is not completely the same on Linux, macos and Windows

Out of curiosity, do you have any references for the analog mechanism under macOS and Windows?

nojb commented 8 months ago

In the case of bytecode, it is possible to imagine a smarter replacement tool that unpacks the bytecode executable, alters just the required symbol and repacks it.

Are these two alternatives really any different?

emillon commented 8 months ago

To make sure I understand, when you say "JS" here you mean the non-JSOO backends, right?

Yes, I wasn't very clear. jsoo would use whatever bytecode does.

Out of curiosity, do you have any references for the analog mechanism under macOS and Windows?

macos: Frameworks and Weak Linking, I believe that just using __attribute((weak)) will even work
windows: need to experiment but it seems that a .o can override something from a .a.

Are these two alternatives really any different?

No. My point with that was that altering a binary file is problematic because you don't know what you're changing in the file. It's not possible (well, not easily possible) to edit a fully-linked native executable to change just a variable, but it's possible to do that with a bytecode executable. Actually, come to think of it, if bytecode executables become compressed, binary substitution is not going to work, but a smart replacement would continue to work.

rgrinberg commented 8 months ago

The proposal seems fine at least as an alternative to binary rewriting. We can determine if it's worth making it the default after we have an implementation and some experience with it in the "real world".

I would like the opinion of @anmonteiro on this proposal as a user of cross compilation and a maintainer of nix.

anmonteiro commented 8 months ago

The proposal looks fine to me. I didn't know about weak symbols either.

Perhaps this could even be used in the future to implement dune-sites too, IIRC that also does opaque binary rewriting.

emillon commented 8 months ago

We discussed the details of this with @dra27 and he suggested that since we are going to tinker with linking with that approach, we can make something even simpler by making the placeholder a string option reference that is set to the actual data when the final executable is linked. This requires the "initializer" module to be linked first but this is similar to what we do for the exit module in the stdlib. A challenge for all of these approaches is that when doing a binary substitution, we don't have to know in advance where the placeholders are; but when generating weak symbols or option references, we have to keep an inventory of these hook points.

ejgallego commented 8 months ago

Thanks for working on this, for my use cases this is much better than the current binary rewriting approach (which indeed prevent us from using any feature that would trigger this in Coq)

However I've always wondered why not just use a simple metadata file, like .bin.build-info which is expected next to the binary?

This apporach has its own complications, but seems to me that it has less than the other 2.

nojb commented 8 months ago

However I've always wondered why not just use a simple metadata file, like .bin.build-info which is expected next to the binary?

Good point.

rgrinberg commented 8 months ago

I think the main disadvantage is that it makes the executables somewhat less relocatable.

hhugo commented 8 months ago

Base and ppx_inline_test rely on similar model that seems to work across plaforms and bytecode/native/jsoo. See https://github.com/janestreet/base/blob/master/src/am_testing.c

nojb commented 8 months ago

I think the main disadvantage is that it makes the executables somewhat less relocatable.

In what sense? As long as the metadata file is moved along with the executable, everything should continue to work after a move, right? Am I missing something?

ejgallego commented 8 months ago

Indeed either a metadata file or a .so library, both need to go near the executable right?

ocaml / dune