itanium-cxx-abi / cxx-abi

C++ ABI Summary
488 stars 89 forks source link

Understanding the role of demangling in the toolchain #72

Open brianthelion opened 5 years ago

brianthelion commented 5 years ago

Apologies for a brief digression here into the more sociological aspects of the ABI.

It's clear who consumes mangled names. But who consumes demangled names?

Naively, it would seem that demangling is provided just to make debugging easier. But then we see at least a couple of examples (1) (2) of semantic extraction from the mangling grammar. Are there existing tools that depend be being able to accurately invert the grammar? What are the ABI standard's responsibilities with respect to demangling support?

I'm interested from the perspective of someone that would find better demangling support helpful. In particular, code generation from demangled names could be possible if the grammar were less ambiguous about the namespace/class distinction. I'd love to understand where the maintainers and the broader community of ABI consumers stand on this sort of thing.

zygoloid commented 5 years ago

The mangled name grammar intends to provide a unique (and ideally as short as is reasonably possible) name for each external-linkage symbol, and to be (sufficiently-experienced-)human readable as well as sufficient to allow a demangler such as __cxa_demangle to produce a more pleasant human readable description of the symbol. But it's not designed to be such a sufficiently rigid representation of the original source as to allow full reconstruction of the original declaration, and I'd view that as scope creep that we should not succumb to (despite being the author of your (1)). For example, I think it would be reasonable to add a namespace/class distinction only if it provides benefits for one of the intended use cases.

brianthelion commented 5 years ago

@zygoloid Appreciate the context, thank you.

The main use-case that I'm interested in involves LD_PRELOAD. Given the importance of function interposition in the developer's toolkit, I don't think it's unreasonable to request that LD_PRELOAD use-cases be admitted to the "intended" category if they aren't there already. You may disagree, but let me elaborate:

99% of the time, effective function interposition through LD_PRELOAD has

  1. An explicit dependency on mangled names; and
  2. An implicit dependency on demangled names.

The explicit dependency on the mangled name is due to use of dlsym(...) as the primary mechanism of runtime interposition. Yes, some LD_PRELOAD use-cases may not involve interposition at all, but those are in the 1% as far as I can tell.

The implicit dependency on the demangled name comes when trying to get LD_PRELOAD to correctly catch flow control in the first place. To do this, the author of the interposition shim has to correctly reverse-engineer the declaration for the callable that s/he wants to get in front of. The workflow there almost always starts with demangling. Due to the ambiguities in the mangling grammar, source code exporting the target symbol is required to achieve certainty about the callable declaration's precise syntax. This is problematic, especially when no source code is available.

On the whole, the argument is: (a) LD_PRELOAD is important and (b) critical LD_PRELOAD workflows depend on demangling, so (3) the ABI should provide enhanced support for those use-cases. Interested to hear your thoughts.

Cheers!

mglisse commented 5 years ago

To do this, the author of the interposition shim has to correctly reverse-engineer the declaration for the callable that s/he wants to get in front of. The workflow there almost always starts with demangling. Due to the ambiguities in the mangling grammar, source code exporting the target symbol is required to achieve certainty about the callable declaration's precise syntax. This is problematic, especially when no source code is available.

I don't think it is true that reverse-engineering is the first step. The natural first step is reading the documentation and sources (possibly partial sources, like a header provided to compile plugins). If you do not have any sources, that means that whoever provided the object to you does not support your interposition, and they could have obfuscated it by renaming all the mangled symbols to just "a", "b", etc. Even if you do have the original mangled names, the class/namespace distinction is likely to be much less of an issue than finding the layout of classes, expected semantics, etc.

brianthelion commented 5 years ago

@mglisse Thanks for your thoughts here as well. I think you may be overlooking the most critical and popular use-case for interposition, namely, tracing. Given that context, I offer a rebuttal:

I look forward to your feedback.

Cheers!