itanium-cxx-abi / cxx-abi

C++ ABI Summary
508 stars 96 forks source link

P1874: Dynamic Initialization Order of Non-Local Variables in Modules #99

Open urnathan opened 4 years ago

urnathan commented 4 years ago

P1874 has an ABI impact as it applies to named modules.

Header-units are affected, but as they have no distinct object file, existing ABI features support the requirements of 1874.

The dynamic initializers of an (transitively) imported module must be complete before the dynamic initializers of (entities declared after the import) in the importing TU are run. The simplest mechanism I was envisioning was for each module interface to provide an idempotent global function, with a symbol name derivable from the module name. Importers call that function in their own global initializer function. Presumably the function would also be called from .init, .init_array or .ctors as appropriate. Richard suggests a range of options in an email -- perhaps copy that here?

While functional it does have the unfortunate property of a call sequence mirroring the import dag shape. It'd be nice to avoid that and other sadness. Here's a feature list, let's start there:

  1. consumers of a module can independently compile the interface to get a CMI, using an already-provided object file. This implies the consumer doesn't know how clever the producer is in reducing dynamic initializers to static initializers.
  2. module interface authors can add dynamic initializations without the consumers needing to recompile. Implying that consumers never know whether an interface has dynamic initializations.
  3. No new (static or dynamic) linker technology. If new technology is available, we should be able to use it though.
  4. Consumers of a primary interface do not need to know about partitions. The primary interface's dynamic initializer needs to take care of them (possibly indirectly for the case of a partition only imported into other partitions).
  5. Interoperable across a shared-object boundary. The object code for an import could be in a shared object. Code in a shared object could have its import's code provided by the main executable.

IMHO all but no2 are hard requirements. I'm ambivalent on that one. Perhaps it could be achieved via no3 -- namely, if you have a to-be-invented linker, you get no2. Otherwise not.

Are there other features I've missed?

urnathan commented 4 years ago

Although not mentioned in 1874, I think we need the same semantics for global-scope thread local storage.

urnathan commented 4 years ago

Richard described several schemes, but settled on 2.
1b) An initialization function symbol per module 2b) An initialization function symbol per importable module unit

Use 1b for initializing dependency modules; call it unconditionally. Use 2b for initializing importable module units; allow calls to be dropped if they can be seen to be no-ops (always allow intra-module optimizations).

These match what I was envisioning, but has the nuance of distinguishing importation of the primary interface, from importation of partitions. (If I understand what 'dependency module' means). However, I think that distinction is essentially meaningless. The set of CMIs comprising a partitioned module need to be built by a single compiler (and I think we're all intending on providing a mechanism by which importers of the primary interface are oblivious to any partionness). But that compiler need not be the compiler that builds the object files that get linked. So it seems to be an implementation detail as to exactly if or how intra-module initialization is optimized.

Each general initialization function would look like: module_Fooinit () { idempotency-check for DI in {direct imports} call module${DI}_init ... do Foo's global inits }

If we broke that apart into 2 pieces, the first calls the direct importer's general init fns. The second part is our specific init fn. Then a smart linker could analyse the initial set of calls to reconstruct the import graph. Then it merely needs to emit code to call the set of second parts in any DFS order.

It would be nice to share some of the machinery with .init_array, but I'm not quite sure how to achieve that without pessimizing the idempotency check. Thinking ...

urnathan commented 4 years ago

Here is Richard's suggested mangling scheme, so we don't lose that: As a starting point for mangling these initialization functions (which should be of type void ()), how about:

(existing)

::= W + E ::= W * E (new) ::= W + P + E ::= GI # dynamic initialization for primary interface unit of module ::= GI # dynamic initialization for specified module partition where the P in the denotes the position of the `:` token. We don't need to talk about substitutions for because it can only appear as the top-level encoding in a complete mangled-name, so there can never be any backreferences. (That said, we could allow the P... part in any so that we can include the partition name in internal linkage symbols appearing in module partitions, and if we did that, we'd want substitution rules for them. But I'm inclined to think that we shouldn't mangle module names into internal linkage symbol names at all, especially given that we could only do so for importable module units anyway, and I'd expect most internal linkage symbols in a module to be in regular implementation units.)
urnathan commented 4 years ago

It is only the static linker that needs to optimize the initialization sequence -- the cost of the dynamic linker optimizing the sequence and then executing it, is going to be greater than just executing it the once. That's not true for TLS initialization, but, if the static linker is optimizing, it can optimize both. Therefore again, not worth teaching the dynamic linker that.

urnathan commented 4 years ago

I think this exemplifies what I'm thinking. The block of calls to direct import initializers is demarked with a special symbol. That way the static linker doesn't need to disassemble the idempotency preamble. Using direct calls is going to be better than an array of fn pointers, both for speed and space, as the calls will (probably) use 4 byte addresses, not 8 byte pointers.

The static linker can nop out calls it knows are unneeded. And we could give it license to alter the contents of the .init_array section, when that references at least one module global initializer function. Thereby just flattening the call graph.

// export module Quux;
// import Foo;
// import Bar;
// import Baz;
extern void ZGI_Foo ();
extern void ZGI_Bar ();
extern void ZGI_Baz ();

int init () { return 42; }

int myvar = init ();

static __attribute__ ((constructor)) void ZGI_Quux ()
{
  // no need to worry about concurrency.  We're a single thread
  static bool done;
  if (done) return;
  done = true;

  __asm ("_ZGD_Quux:"); // mark start of direct import init calls
  ZGI_Foo ();
  ZGI_Bar ();
  ZGI_Baz ();

  // Quux's inits
  myvar = init ();
}
urnathan commented 4 years ago

Perhaps better mangling prefixes are 'GI' for the initialization function, and 'Gi' for the sequence of calls.

I can't find thread_local support in the ABI. By inspection GCC emits per-global thread_local dynamic initialization functions, using a TH prefix. Users unsure of whether there is a global init ('extern thread_local T var;'), provide a weak declaration of the function, and do the call-if-non-zero dance. The initializer function maintains a per-thread guard variable. There's no guarantee that thread_locals declared earlier in a TU are dynamically initialized before a later variable is. Perhaps we can just punt on thread_local? Otherwise, it seems a quite expensive thread-startup cost, unless we can guarantee the static linker optimizes it. Global thread locals are expensive, but at least you only pay that cost per-use.

urnathan commented 4 years ago

We have a vendor extension of constructor priorities, I suspect other implementations have that too. I do not propose doing anything to change that -- priorities are essentially an attempt at getting some kind of global constructor ordering, and it's somewhat brittle. It would be very expensive to apply the kind of ordering 1874 requests across each priority level. Let's not go there.

urnathan commented 3 years ago

While the above scheme works, we could do better. I disregarded this potential optimization as I thought it required collusion between the compiler-generating-the-obj and the compiler-generating-the-cmi. These are very often the same compilation, but that need not be the case. In GCC I added -fmodule-only to elide generation of the object file, at user request. Now modules landed in GCC I have of course received a bug report about unnecessary startup goup. Specifically, when compiling interface X, we know whether that object has dynamic initializers and/or imports modules that have them. 1) we can elide calling a global-module-ctor for imports that have no dynamic initialization and 2) if we also have no dynamic initializers we do not need to emit the global module constructor. but doing so relies on us knowing that the compiler emitting the object has made the same determination about the existence of dynamic initializations. And it might have made a different decision about promoting a dynamic initialization to a static one. (different optimization level?). If 'has dynamic initialization' is determined before any such promotion, then it is a property of the source, and we can rely on it. Thus I propose 1) you only emit calls to known-needed module-global-ctors (this can be indicated by some flag in the module's CMI) 2) you only emit a module-global-ctor when it is known non-empty.

fweimer-rh commented 3 years ago

The dynamic linker also needs to know the number of global objects. At least in the glibc implementation, we cannot throw from an ELF constructor if there is a memory allocation failure while for the bookkeeping data which is needed to record the initialization order.

This is also a bug in the current implementation of C++ global data objects, and fixing it has ABI impact (but perhaps not at the Itanium ABI layer).

urnathan commented 3 years ago

Thanks Florian, that sounds like an orthogonal problem. There's a variant of #2 in that we could emit an empty function (other than the minimal setup/teardown) in the cases where there are no dynamic inits. That'd leave #1 to be an implementation detail, but with the ABI-visible behaviour of the dynamic-module-ctor not being a hook into which arbitrary code could be injected.