Tracking issue for RFC 2603, "Rust Symbol Mangling (v0)"

Centril commented 5 years ago

This is a tracking issue for the RFC "Rust Symbol Mangling (v0)" (rust-lang/rfcs#2603).

Current status:

Since #90128, you can control the mangling scheme with -C symbol-mangling-version, which can be:

legacy: the older mangling version, still the default currently
- explicitly specifying this is unstable-only and also requires -Z unstable-options (to allow for eventual removal after v0 becomes the default)
v0: the new RFC mangling version, as implemented by #57967

(Before #90128, this flag was the nightly-only -Z symbol-mangling-version)

To test the new mangling, set RUSTFLAGS=-Csymbol-mangling-version=v0 (or change rustflags in .cargo/config.toml). Please note that only symbols from crates built with that flag will use the new mangling, and that tool support (e.g. debuggers) will be limited initially, until everything is upstreamed. However, RUST_BACKTRACE and rustfilt should work out of the box with either mangling version.

Steps:

[x] Implement the RFC (https://github.com/rust-lang/rust/pull/57967 + https://github.com/alexcrichton/rustc-demangle/pull/23)
[x] Upstream C implementation of the demangler to:
- [x] binutils/gdb (GNU libiberty)
- [x] [PATCH] Move rust_{is_mangled,demangle_sym} to a private libiberty header. committed as https://github.com/gcc-mirror/gcc/commit/979526c9ce7bb79315f0f91fde0668a5ad8536df
- [x] [PATCH] Simplify and generalize rust-demangle's unescaping logic. committed as https://github.com/gcc-mirror/gcc/commit/42bf58bb137992b876be37f8b2e683c49bc2abed
- [x] [PATCH] Remove some restrictions from rust-demangle. committed as https://github.com/gcc-mirror/gcc/commit/e1cb00db670e4eb277f8315ecc1da65a5477298d
- [x] [PATCH] Refactor rust-demangle to be independent of C++ demangling. (original submission) committed as https://github.com/gcc-mirror/gcc/commit/32fc3719e06899d43e2298ad6d0028efe5ec3024
- [x] [PATCH] Support the new ("v0") mangling scheme in rust-demangle. (original submission) committed as https://github.com/gcc-mirror/gcc/commit/84096498a7bd788599d4a7ca63543fc7c297645e
- [x] Linux perf (through binutils 2.36 and/or libiberty 11.0, or later versions - may vary between distros)
- [x] valgrind
[x] Implement demangling support in LLVM, including lldb, lld, llvm-objdump, llvm-nm, llvm-symbolizer, llvm-cxxfilt
[x] Resolve issue around rustc generating invalid symbol names (https://github.com/rust-lang/rust/issues/83611)
[ ] Adjust documentation (see instructions on rustc-guide)
[ ] Stabilization PR (see instructions on rustc-guide)

Unresolved questions:

[x] Punycode vs UTF-8, some prior discussion in https://github.com/rust-lang/rust/issues/7539
[x] Encoding parameter types for function symbols

Desired availability of tooling:

Linux:

Tools: binutils, gdb, lldb, perf, valgrind

Distro	Has versions of all tools with support?
Debian (latest stable)	?
Arch	?
Ubuntu (latest release)	?
Ubuntu (latest LTS)	?
Fedora (latest release)	?
Alpine (latest release)	?

Windows:

Windows does not have support for demangling either legacy or v0 Rust symbols and requires debuginfo to load the appropriate function name. As such, no special support is required.

macOS:

More investigation is needed to determine to what extent macOS system tools already support Rust v0 mangling.

oli-obk commented 3 years ago

yea... just make sure that anything that uses destructure_const needs a feature gate, I'm not too sure it behaves soundly for const generics in all cases

michaelwoerister commented 3 years ago

@eddyb, I'm not sure we are talking about the same thing here. I'm proposing to make rustc emit <type> <const-data> (with <const-data> being a hash of the constants value) just as a temporary solution until we have a proper grammar for ADT constants. Are you proposing to get rid of the <const> = <type> <const-data> production entirely?

eddyb commented 3 years ago

Are you proposing to get rid of the <const> = <type> <const-data> production entirely?

That production is ~~a lie~~ more general than what is implemented (which is only integer types, bool and char for the <type> part).

Right now demanglers will treat any other type in that position as an error, so using it for that purpose requires changing demanglers, just like adding a special form for the opaque hash-only leaves, but leaving less encoding space usable by any future ADT mangling.

michaelwoerister commented 3 years ago

@eddyb: OK, I'm all for solving this properly now -- iff we can get it done in a reasonable amount of time 😃

It sounds like you and @oli-obk have come up with an exhaustive list of things the grammar needs to support, right? But it also sounds like the implementation on the compiler side is somewhat complicated by the fact that destructure_const isn't quite reliable yet?

oli-obk commented 3 years ago

But it also sounds like the implementation on the compiler side is somewhat complicated by the fact that destructure_const isn't quite reliable yet?

it should work just fine for all normal aggregates like arrays, tuples, enums and structs, but there are certainly types it cannot handle or will handle weirdly. I'm fairly certain that all currently legal const generic types will work just fine, so that should be ok. It's just that extending the list of legal types is not trivially ok.

eddyb commented 3 years ago

@eddyb: OK, I'm all for solving this properly now -- iff we can get it done in a reasonable amount of time

Initial mangling implementation (and grammar) up at #87194 - I spent more time convincing myself that I couldn't cut certain corners in the grammar, writing the comments for it, and refactoring the current handling of placeholders (which is in a separate commit), than adding the new support.

That's mostly because deref_const and destructure_const already exist, and have been around for a few months if not longer, so we could've had this done before #85530 was opened - a lot of my PR is just copy-paste from ty::print::pretty, and adjusting the output to be the mangling we want (instead of user-facing).

Hopefully the extended constant mangling grammar doesn't end up being a bikeshed of its own.

joshtriplett commented 2 years ago

Would it be reasonable, before changing the default, to stabilize the option to change the symbol mangling format? That would allow people to opt into the v0 format, and in particular would unblock the usage of it in tools such as instrumentation.

I'd be happy to submit a patch stabilizing the option.

eddyb commented 2 years ago

Would it be reasonable, before changing the default, to stabilize the option to change the symbol mangling format? That would allow people to opt into the v0 format, and in particular would unblock the usage of it in tools such as instrumentation.

I'd be happy to submit a patch stabilizing the option.

That would result in us having to support the old mangling indefinitely, I think that differs from what was previously discussed.

We could start by making the default depend on nightly vs stable, so that it's only v0 on nightly where you can olt out of it with the unstable flag.

Unless, hmm, maybe you meant stabilizing the CLI flag but not (all) its values, i.e. -C symbol-mangling-version=legacy would require -Z unstable-options, but -C symbol-mangling-version=v0 wouldn't.

I think I could get behind that, guaranteeing only the versions that have gone through RFCs.

joshtriplett commented 2 years ago

@eddyb Right, I was proposing stabilizing the option but not guaranteeing that it supports any particular value. We could choose to stabilize the v0 value now, and then consider stabilizing the legacy value when we change the default (since there's no reason to pass it at all until the default changes).

joshtriplett commented 2 years ago

@eddyb I submitted https://github.com/rust-lang/rust/pull/90128 to stabilize -C symbol-mangling-version=v0.

nnethercote commented 2 years ago

An issue came up in https://github.com/rust-lang/rust/pull/89917#issuecomment-963755731 that is worth mentioning here: the compiler currently generates some v0 symbols that have a .llvm.<numbers> suffix, which violate the v0 spec (it doesn't allow '.' chars), and some v0 demangler implementations (libiberty and Valgrind) fail to demangle these symbols.

Either the compiler should be fixed to not append these suffixes (which may be hard, because it's LLVM that's adding them) or the v0 spec should be modified to permit these suffixes, and the libiberty/Valgrind implementations should be updated accordingly.

eddyb commented 2 years ago

Either the compiler should be fixed to not append these suffixes (which may be hard, because it's LLVM that's adding them) or the v0 spec should be modified to permit these suffixes, and the libiberty/Valgrind implementations should be updated accordingly.

I don't think either of them is correct - or at least rustc-demangle doesn't do either, and does handle those pesky suffixes.

Does C++ Itanium mangling allow for the .llvm. suffixes? AFAIK no, but you should get them if you use Clang with LTO.

What rustc-demangle does, and what these tools should probably also do, is limit the symbol to just before the suffix, before attempting to demangle at all with any mangling scheme.

IOW, I believe this current behavior is a bug:

$ c++filt _ZN3foo3barE
foo::bar
$ c++filt _ZN3foo3barE.llvm.123
_ZN3foo3barE.llvm.123

(at least in a world where LLVM's LTO exists - one could argue that they screwed up by doing this)

EDIT: some precedent for similar issues with compiler passes suffixing symbols (tho the fix seems to be way more high-level than I would want): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40831

bjorn3 commented 2 years ago

Does C++ Itanium mangling allow for the .llvm. suffixes? AFAIK no, but you should get them if you use Clang with LTO.

https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling-general

Mangled names containing $ or . are reserved for private implementation use. Names produced using such extensions are inherently non-portable and should be given internal linkage where possible.

It does kind of allow them.

nbdd0121 commented 2 years ago

Does C++ Itanium mangling allow for the .llvm. suffixes? AFAIK no, but you should get them if you use Clang with LTO.

Yes.

   <mangled-name> ::= _Z <encoding>
                  ::= _Z <encoding> . <vendor-specific suffix>
   <encoding> ::= <function name> <bare-function-type>
         ::= <data name>
         ::= <special-name>
A <mangled-name> containing a period represents a vendor-specific version or portion of the entity named by the <encoding> prior to the first period. There is no restriction on the characters that may be used in the suffix following the period.

nnethercote commented 2 years ago

https://bugs.kde.org/show_bug.cgi?id=445916 has been filed for possibly updating Valgrind's v0 demangler to handle these suffixes, though it's still a bit unclear to me if that's the right thing to do.

comex commented 2 years ago

though it's still a bit unclear to me if that's the right thing to do.

Well, if it helps:

With LLVM, names like foo.llvm.3285396211802591752 and names like foo.5 both arise when a symbol that originally had linkage local to a single object file, and thus only needed a locally unique name, goes through LTO and suddenly needs a globally unique name.

Details

Names like foo.llvm.3285396211802591752 come from ThinLTO when a local symbol ends up being referenced from a different object file, because some reference to it that was originally from the first object file was inlined into the second one. Since ThinLTO compiles each bitcode object file into its own native object file, such a symbol has to be changed to global linkage so that the native linker can resolve the reference; the suffix is preemptively attached to avoid clashes with any other symbols also named foo (which could be either a local symbol from another object file that was transformed in the same way, or a symbol that was already global).
Names like foo.5 come from full LTO. Full LTO combines everything into a single native object file, so symbols that were local in the input bitcode object files can stay local in the output native object file, but the space of "local" names now includes all symbols from all of the input files. Since full LTO is not incremental, the suffix is not attached preemptively, but only when there actually are two symbols with the same name from two different input files.

In both cases, the suffixed foo has the same semantics as the original foo, and the suffix is just to ensure a unique name. So it should be fine to ignore the suffix.

But there are also names like foo.cold.1, which represents a partial chunk of foo that was split off into its own function. In this case, foo.cold.1 does not have the same semantics as foo. Ignoring the suffix is still fine if you are just trying to symbolicate a backtrace. But for more obscure use cases it may not be fine. Suppose you want to log whenever a function is called; treating each call to foo.cold.1 as a call to foo would be misleading.

comex commented 2 years ago

Also, regarding the idea of just getting rid of the suffixes:

For Rust code, it might theoretically be possible to get rid of suffixes that exist to ensure global uniqueness, under the assumption that Rust compilers will never actually produce two unrelated symbols with the same mangled name. But C and C++ compilers can and do produce such symbols (those languages expose local linkage directly in the form of static, so you just need static functions/variables with the same name in two different source files). So Valgrind's C++ demangler at least ought to be dealing with these suffixes, and in theory the same code could be reused for Rust.

On top of that, even in Rust code, suffixes like .cold.1 can't be avoided unless you want to get rid of the hot-cold splitting optimization altogether. When LLVM splits out a chunk of a function into its own function, it has to give the new function some name. It can't give it the same name as the original function since the original still exists. In lieu of that, naming it after the original function plus a suffix is clearly more useful than, say, giving it a random name.

str4d commented 2 years ago

I'm writing a Ghidra script for demangling Rust symbols, and I have a question about the <const> grammar:

<const> = <type> <const-data>
        | "p" // placeholder, shown as _
        | <backref>

The "p" and <backref> cases are duplicated, because <type> is defined as:

<type> = <basic-type>
       | (... elided ...)
       | <backref>

<basic-type> = "a"      // i8
             | (... elided ...)
             | "p"      // placeholder (e.g. for generic params), shown as _

So it seems to me that <const> can be parsed as either "p" or "p" <const-data>, and similarly either <backref> or <backref> <const-data>. Is this intentional (i.e. is <const-data> partially-optional)? It seems like I need to do the following to process <const>:

Parse input as <type>.
- If <type> succeeds, parse rest with <const-data>.
- If <const-data> succeeds, return <type> <const-data>.
- If <const-data> fails, backtrack and inspect <type>.
  - If <type> in ["p", <backref>], return <type>.
  - Else, <const> fails.
- If <type> fails, then we know that input was not either "p" or <backref> (otherwise it would have succeeded), so <const> fails.

But I'm also not sure how to interpret a <const> with an optional <const-data>.

As an aside, this feels ambiguous, because unlike other optional parts of the grammar, <const-data> has no distinguishing prefix:

<const-data> = ["n"] {<hex-digit>} "_"

I think it's actually unambiguous, but only indirectly, due to how sentinel characters of other parts of the grammar were selected. Say we have [[T; M]; N]:

"A" "A" <type> <const> <const>

Given the current grammar, that could potentially be either:

"A" "A" <type> "p" {<hex-digit>} "_" <const>

or:

"A" "A" <type> "p" <basic-type> <const-data>

and <hex-digit> can collide with <basic-type>.

However, there's nowhere else in the grammar that allows <basic-type> to be directly followed by "_", so if the data contained "p" <const-data> but the parser tried to interpret <const> as "p" first, the parser will eventually figure out the problem and can backtrack.

I believe (but have not analyzed) that a collision in the other direction should also be detected indirectly. This partial optionality seems somewhat brittle though, especially as the RFC leaves extending <const-data> as a future task.

str4d commented 2 years ago

I had a look at how rustc implements const mangling: https://github.com/rust-lang/rust/blob/028c6f1454787c068ff5117e9000a1de4fd98374/compiler/rustc_symbol_mangling/src/v0.rs#L578-L733

I think this corresponds to the following grammar:

<const> = "p"
        | <backref>
        | <subset-of-type> <const-data>

This also seems to be the grammar that rustc-demangle uses.

So I now believe this is a bug in the RFC.

tmiasko commented 2 years ago

@str4d in production <const> = <type> <const-data>, <type> is never p. There are pending updates in https://github.com/rust-lang/rfcs/pull/3161 which make this explicit in the grammar.

str4d commented 2 years ago

Aha, yes that does indeed address my concern: <const> now never reaches <type> other than via "V" <path> <const-fields>, which has a separating prefix. Thanks!

nnethercote commented 2 years ago

https://bugs.kde.org/show_bug.cgi?id=445916 has details of some progress on the gcc/libiberty/valgrind side about handling the suffixes added by LLVM.

eddyb commented 2 years ago

@Amanieu According to the edit logs, almost exactly a year ago you checked the box for "Linux perf" ~~but nobody has changed this file at all for v0: torvalds/linux@fb71c86 / tools/perf/util/demangle-rust.c~~

I ran into lack of support while trying to get good symbol names with perf record -g.

However, looking closer at how it's driven, Rust v0 support seems to "just" require libbfd from binutils 2.36 (or later), so I'll add that to the checkbox, in case anyone else looks at it again (I happen to have binutils 2.35.2 instead). (EDIT: some distros seem to link binutils against libiberty from GCC sources, apparently ignoring binutils's vendored copy, so in that case libiberty 11.0 is minimum required)

If anyone is familiar with Linux kernel patches, these can be removed nowadays (since Rust legacy demangling has been working through libbfd for many years now AFAICT):

Amanieu commented 2 years ago

I checked the box for Linux perf because no change to the kernel was required: the new demangler will automatically get picked up from the updated libiberty.

michaelwoerister commented 2 years ago

FYI, @lqd opened a PR that extends the RFC with "vendor-specific suffixes" like .llvm.123: https://github.com/rust-lang/rfcs/pull/3224

Please provide your feedback if you have any.

ojeda commented 2 years ago

If anyone is familiar with Linux kernel patches, these can be removed nowadays (since Rust legacy demangling has been working through libbfd for many years now AFAICT):

tools/perf/util/demangle-rust.c

the code invoking demangle-rust.c functions

There was https://lore.kernel.org/lkml/20220201185054.1041917-1-german.gomez@arm.com/, but German notes:

I have decided to drop this patch.

It turns out that even shipped versions of libbfd and libiberty don't demangle some of the symbols completely

For example:

(doesn't strip away the hash at the end) _ZN10rs_tracing8internal11TRACE_STATE17h41dcd282cd61069dE.0                 ==> rs_tracing::internal::TRACE_STATE::h41dcd282cd61069d
(doesn't demangle full symbol)           _ZN41_$LT$bool$u20$as$u20$core..fmt..Debug$GT$3fmt17h10f4b7b0094c3a75E.2262 ==> _$LT$bool$u20$as$u20$core..fmt..Debug$GT$::fmt::h10f4b7b0094c3a75

These are cleaned up afterwards by perf's demangler.

eddyb commented 2 years ago

It turns out that even shipped versions of libbfd and libiberty don't demangle some of the symbols completely

How is that possible? The code in demangle-rust.c is copy-pasted from what libiberty used to have for years (until I changed it when unifying it with the v0 demangler).

Also, I'm guessing the .0 and .2262 in those examples are stripped by perf before passing them off to libiberty? (since libiberty doesn't handle those suffixes correctly and just refuses to demangle entirely AFAIK - this is only now getting fixed)

Regarding the hash at the end, I think that's controlled by demangler flags (-i is --no-verbose):

$ c++filt --version
GNU c++filt (GNU Binutils) 2.35.2
Copyright (C) 2020 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) any later version.
This program has absolutely no warranty.
$ c++filt '_ZN10rs_tracing8internal11TRACE_STATE17h41dcd282cd61069dE'
rs_tracing::internal::TRACE_STATE::h41dcd282cd61069d
$ c++filt '_ZN41_$LT$bool$u20$as$u20$core..fmt..Debug$GT$3fmt17h10f4b7b0094c3a75E'
<bool as core::fmt::Debug>::fmt::h10f4b7b0094c3a75
$ c++filt -i '_ZN10rs_tracing8internal11TRACE_STATE17h41dcd282cd61069dE'
rs_tracing::internal::TRACE_STATE
$ c++filt -i '_ZN41_$LT$bool$u20$as$u20$core..fmt..Debug$GT$3fmt17h10f4b7b0094c3a75E'
<bool as core::fmt::Debug>::fmt

The lack of $ unescaping is really worrying however - I can only reproduce if I force C++ demangling:

$ c++filt --format gnu-v3 '_ZN41_$LT$bool$u20$as$u20$core..fmt..Debug$GT$3fmt17h10f4b7b0094c3a75E'
_$LT$bool$u20$as$u20$core..fmt..Debug$GT$::fmt::h10f4b7b0094c3a75

So maybe it's not the verbosity level, but "just" perf somehow forcing C++-only mode, instead of the automatic default?

Either that or really old libiberty versions, but the same code was added to both libiberty and perf around the same time.

nnethercote commented 2 years ago

https://bugs.kde.org/show_bug.cgi?id=445916 has been filed for possibly updating Valgrind's v0 demangler to handle these suffixes, though it's still a bit unclear to me if that's the right thing to do.

The suffixes are now handled by gcc/libiberty, and those changes have been imported into Valgrind, and this Valgrind bug has been closed.

bstrie commented 2 years ago

Now that RFC https://github.com/rust-lang/rfcs/pull/3224 has been merged to resolve the suffix question, are there any implementations that still need updating in order to handle . and $ suffixes? If not, shall we revive https://github.com/rust-lang/rust/pull/89917 for making v0 the default?

programmerjake commented 2 years ago

note that recent versions of rustc use -C symbol-mangling-version=v0 rather than the -Z flag. the top comment led me astray..

eddyb commented 2 years ago

note that recent versions of rustc use -C symbol-mangling-version=v0 rather than the -Z flag. the top comment led me astray..

@programmerjake Thanks for bringing it up (it got forgotten) - I just tried updating it, is the new version better?

programmerjake commented 2 years ago

note that recent versions of rustc use -C symbol-mangling-version=v0 rather than the -Z flag. the top comment led me astray..

@programmerjake Thanks for bringing it up (it got forgotten) - I just tried updating it, is the new version better?

yup, thx!

eddyb commented 2 years ago

So @Gankra was showing my some non-trivial v0 symbols and that got me pondering about richer presentation than just text. Even for plain text, I was able to prototype some stuff w/ jq, but that's a pile of hacks approximating "balanced <>/()" parsing.

Ideally we wouldn't be going the long way around through the "parse v0 mangling -> emit quasi-Rust (type) syntax -> parse quasi-Rust syntax -> pretty-print AST" pipeline.

I'm not sure what the status is on @michaelwoerister's v0->AST demangler (which predates my "direct"/"allocation-less" v0 demangler in rustc-demangle) but from a quick glance it should be mostly compatible already? Additional testing and/or consolidation with the rustc-demangle repo would not be hard, if anyone is interested to pick it back up. _{(Also, would it make sense to share code between them? Not trying to start a dozen bikesheds though, there should probably be dedicated issues for tracking anything that specific)}

The trickier parts would be adding support for newer additions, e.g.:

https://github.com/rust-lang/rustc-demangle/pull/55
- _{OTOH, an AST demangler does have the advantage of being able to allocate e.g. the hex-encoded constant data payloads, and rely on standard APIs (String::from_utf8), instead of having to do everything on the fly like rustc-demangler does}

(EDIT: @EFanZh pointed out that they also have a demangler to an AST, which I likely have seen before, so I'm really sorry I lost track of it - also, not only does it appear to have all the new consts, it also handles str constants the easy way like I was describing above, heh)

How would tools even use a demangled AST to provide a better experience? No one idea seems particularly strong on its own, but here's a few:

minimal pretty-printing should be doable with little effort
- _{e.g. try to fit each AST node on one line without going past terminal-width/80cols/some other limit, and once it fails for some node, all the "larger" nodes (that refer, directly or indirectly, to that one node) will be forced to a multi-line format (simply by e.g. building larger strings from smaller strings)}
the presence of backrefs can be exploited to cheaply refer to the same type many times
- perhaps show "less important" occurrences in a more reduced form? (e.g. …::Foo<…> - literally using ellipses - instead of foo::Foo<Bar, Baz>)
any interactive UI can fill in details outside-in, focusing on providing the most relevant information first (without permanently hiding anything, merely requiring additional interactions, like hovering over, or clicking an …)
GUIs (and, to a lesser extent, TUIs) can vary color and other text aspects
- _{syntax highlighting might be the first instinct, but other concepts like LISP-style "rainbow delimiters" (in our case, for any <>/()/[]/{}) or "semantic highlighting" (typically coloring identifiers based on their name resolution results, AFAICT), are probably more relevant}
_{I'm sure I forgot much cooler tricks, alas,}

Anyway, all that said, I'm going to duplicate some screenshots here as well:

HTML+CSS attempt at selective empahasis: ^{(using a manually annotated demangling, with - standing in for <small> and + for </small>)}
cursed jq contraption being applied to a longer Rust symbol ^{(this one's automated; also works for C++)}

EFanZh commented 2 years ago

@eddyb I have written a demangler that can demangle the latest v0 syntax symbol into a structured AST: https://github.com/EFanZh/ast-demangle/.

eddyb commented 2 years ago

@eddyb I have written a demangler that can demangle the latest v0 syntax symbol into a structured AST: https://github.com/EFanZh/ast-demangle/.

Oh, my bad @EFanZh, now that I'm looking at it, I'm pretty sure I've seen it before and just forgot :(

pnkfelix commented 2 years ago

Discussed in T-compiler backlog bonanza

The v0 symbol mangling has been implemented. From https://github.com/rust-lang/rust/pull/89917 we have considered making v0 the default, but we have held off on doing so in order to give external tools time to add support. In PR #90054 we did make v0 the default for builds of rustc itself (but not object code generated by rustc on other programs).

We need to figure out what criteria we will use in this and other cases to decide that "it is time" to switch the defaults.

(We also considered opening a separate tracking issue for the question of "when to switch the default", but at this point I think we would only open such a tracking issue if we were ready to close this one, #60705, itself.

@rustbot label: S-tracking-needs-to-bake

bstrie commented 1 year ago

we have held off on doing so in order to give external tools time to add support

We need to figure out what criteria we will use in this and other cases to decide that "it is time" to switch the defaults.

The first thing to do would be to produce a list of tools that people want to support. For each tool, we should determine whether it supports v0, and, if so, the date of the first public release that features v0 support. Once each tool supports v0, and once each has supported v0 for long enough (precise criteria TBD), then stabilization should be unblocked.

Obviously this list cannot guarantee that it will exhaustively mention every tool ever made, but the only alternative would be to never stabilize v0 for fear of overlooking some tool. In the meantime, we can use a blog post to put out a general call to tool developers to ask them to ensure that v0 works with their tools.

jyn514 commented 1 year ago

The first thing to do would be to produce a list of tools that people want to support. For each tool, we should determine whether it supports v0, and, if so, the date of the first public release that features v0 support. Once each tool supports v0, and once each has supported v0 for long enough (precise criteria TBD), then stabilization should be unblocked.

Obviously this list cannot guarantee that it will exhaustively mention every tool ever made, but the only alternative would be to never stabilize v0 for fear of overlooking some tool. In the meantime, we can use a blog post to put out a general call to tool developers to ask them to ensure that v0 works with their tools.

Nominating to hopefully act as a forcing function to create this list.

nnethercote commented 1 year ago

One problem with v0 mangling that hasn't been identified: it completely breaks the cargo llvm-lines tool. Here is example output with legacy mangling:

  Lines                 Copies              Function name
  -----                 ------              -------------
  134295                3225                (TOTAL)
    6102 (4.5%,  4.5%)    18 (0.6%,  0.6%)  alloc::raw_vec::RawVec<T,A>::grow_amortized
    2641 (2.0%,  6.5%)    64 (2.0%,  2.5%)  core::option::Option<T>::map
    2329 (1.7%,  8.2%)    17 (0.5%,  3.1%)  <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::next
    1716 (1.3%,  9.5%)    11 (0.3%,  3.4%)  alloc::raw_vec::RawVec<T,A>::allocate_in
    1694 (1.3%, 10.8%)    15 (0.5%,  3.9%)  alloc::alloc::box_free
    1476 (1.1%, 11.9%)    18 (0.6%,  4.4%)  alloc::raw_vec::RawVec<T,A>::current_memory
    1461 (1.1%, 13.0%)     3 (0.1%,  4.5%)  hashbrown::raw::RawTable<T,A>::reserve_rehash
    1456 (1.1%, 14.1%)    16 (0.5%,  5.0%)  core::slice::iter::Iter<T>::new
    1249 (0.9%, 15.0%)     8 (0.2%,  5.3%)  <T as alloc::slice::hack::ConvertVec>::to_vec
    1065 (0.8%, 15.8%)     5 (0.2%,  5.4%)  aho_corasick::automaton::Automaton::leftmost_find_at_no_state_imp

And with v0 mangling:

  Lines                 Copies              Function name
  -----                 ------              -------------
  134295                3225                (TOTAL)
     960 (0.7%,  0.7%)     1 (0.0%,  0.0%)  <regex[455e3194582446bb]::prog::Program as core[d1a89b04220dd38d]::fmt::Debug>::fmt
     722 (0.5%,  1.3%)     1 (0.0%,  0.1%)  <regex[455e3194582446bb]::exec::ExecBuilder>::build
     544 (0.4%,  1.7%)     1 (0.0%,  0.1%)  <regex[455e3194582446bb]::dfa::Fsm>::exec_at
     497 (0.4%,  2.0%)     1 (0.0%,  0.1%)  <regex[455e3194582446bb]::compile::Compiler>::compile_many
     494 (0.4%,  2.4%)     1 (0.0%,  0.2%)  <aho_corasick[afd2d59d996825a5]::nfa::NFA<u32> as core[d1a89b04220dd38d]::fmt::Debug>::fmt
     487 (0.4%,  2.8%)     1 (0.0%,  0.2%)  <hashbrown[18cdbe82094945b3]::raw::RawTable<(&usize, &alloc[c687d6376d1d0c58]::string::String)>>::reserve_rehash::<hashbrown[18cdbe82094945b3]::map::make_hasher<&usize, &usize, &alloc[c687d6376d1d0c58]::string::String, std[e45faeee946555a1]::collections::hash::map::RandomState>::{closure#0}>
     487 (0.4%,  3.1%)     1 (0.0%,  0.2%)  <hashbrown[18cdbe82094945b3]::raw::RawTable<(alloc[c687d6376d1d0c58]::string::String, usize)>>::reserve_rehash::<hashbrown[18cdbe82094945b3]::map::make_hasher<alloc[c687d6376d1d0c58]::string::String, alloc[c687d6376d1d0c58]::string::String, usize, std[e45faeee946555a1]::collections::hash::map::RandomState>::{closure#0}>
     487 (0.4%,  3.5%)     1 (0.0%,  0.2%)  <hashbrown[18cdbe82094945b3]::raw::RawTable<(regex[455e3194582446bb]::dfa::State, u32)>>::reserve_rehash::<hashbrown[18cdbe82094945b3]::map::make_hasher<regex[455e3194582446bb]::dfa::State, regex[455e3194582446bb]::dfa::State, u32, std[e45faeee946555a1]::collections::hash::map::RandomState>::{closure#0}>
     456 (0.3%,  3.8%)     1 (0.0%,  0.3%)  <regex[455e3194582446bb]::compile::Compiler>::c_alternate
     433 (0.3%,  4.1%)     1 (0.0%,  0.3%)  <alloc[c687d6376d1d0c58]::alloc::Global as core[d1a89b04220dd38d]::alloc::Allocator>::shrink

Note the difference in the copies column. cargo llvm-lines entirely depends on the type-imprecison of legacy mangling. We go from having N different functions with the same name being combined, to every function being separate. E.g. with legacy mangling all the grow_amortized instances end up in the same bucket, while with v0 mangling they look like this:

     339 (0.3%,  6.8%)     1 (0.0%,  0.6%)  <alloc[c687d6376d1d0c58]::raw_vec::RawVec<(char, char)>>::grow_amortized
     339 (0.3%,  7.0%)     1 (0.0%,  0.6%)  <alloc[c687d6376d1d0c58]::raw_vec::RawVec<(u8, u32)>>::grow_amortized
     339 (0.3%,  7.3%)     1 (0.0%,  0.7%)  <alloc[c687d6376d1d0c58]::raw_vec::RawVec<(usize, usize)>>::grow_amortized

This is probably a case where cargo llvm-lines needs to change, rather than v0 mangling, but I thought it worth mentioning.

cc @dtolnay

sanmai-NL commented 1 year ago

Is or will there be an official Rust name mangling library (functionality), rather than demangling? Sometimes, one needs to mangle Rust item paths to look into binaries, e.g. like perf does. I hope there will be a reference implementation of specification.

bjorn3 commented 1 year ago

Why would perf need to mangle names? There is no way to exactly reproduce symbol names outside of rustc itself given that they contain a crate disambiguator whose value depends on the -Cmetadata arguments passed when compiling the crate that defined the mentioned function/type (which for the standard library is unknown) as well as the exact rustc version used. Even two consecutive nightly releases will produce different symbol names.

sanmai-NL commented 1 year ago

Please re-read my sentence @bjorn3. I'm not claiming perf mangles names.

sanmai-NL commented 1 year ago

@bjorn3 Thanks for your explanation. I hope you didn't assume every commenter should know these details. I think the question is legitimate. It was asked before in the context of GCC C++. The information required I have available, but that's not important now. Even if just the algorithm were to be specified like in the GCC case, perhaps enough of the translation symbol to mangled symbol can be reconstructed to find the specific symbol in a binary for a given item path. That's my use case but I don't assume this would be the only solution of the only use case for a mangling spec or reference implementation.

bjorn3 commented 1 year ago

perhaps enough of the translation symbol to mangled symbol can be reconstructed to find the specific symbol in a binary for a given item path. That's my use case but I don't assume this would be the only solution of the only use case for a mangling spec or reference implementation.

It should be possible to have something like an api where you specify in the input a wildcard for the crate disambiguator and then the name mangling library would output a wildcard where it would otherwise print the crate disambiguator. Would this work for your use case?

CAD97 commented 1 year ago

I hope there will be a reference implementation of specification.

A reference implementation does already exist, essentially, as part of rustc. That it's not a reusable library just reflects that the goal of a known mangling scheme is the ability for 3rd party non-rustc tooling to be able to turn mangled symbols back into the demangled human-meaningful form. Being able to mangle symbols is explicitly a non-goal.

Sometimes, one needs to mangle Rust item paths to look into binaries, e.g. like perf does.

For binary introspection, demangling is sufficient. Given an unmangled name, to find the corresponding mangled names[^s], you don't mangle the unmangled name to compare to the mangled symbols; instead, you demangle the symbols from the binary to compare to the unmangled symbol. Most of the time you'll want the full list of demangled symbols anyway, e.g. for display or otherwise.

[^s]: Names, plural; multiple crates with the same name will have symbols which collide when unmangled and are disambiguated with the crate disambiguator.

If you want fully predictable names (e.g. for linking manually ABI-stable interfaces), you should be specifying them explicitly. It would be interesting to be able to request v0 mangling (without the use of disambiguators) rather than having to manually apply a mangling scheme, but that's a completely separate feature request than the use for Rust-only names tracked here.

sanmai-NL commented 1 year ago

@CAD97

When you need to step back to the same binary you demangled symbols of, and determine to what mangled symbol a demangled name refers, then you may want this functionality. Please also consider that binaries you have and even a build pipeline including source code, does not mean you are free to modify the source code to achieve predicable symbol names or whatever.

By the way, it's bit of a semantic discussion what demangling entails, in response to my functional requirement at least. Third party tooling like perf may only demangle in a strict sense, but could be considered to mangle a given name that exists as a mangled symbol in the binary:

perf \
  probe \
    --exec $(realpath "mycrate/target/debug/deps/binary-cfcd9bd03ac152c2") \
    --add="uprobe123=mycrate\:\:tests\:\:test_1"

perf demangles the symbols and then matches with the unmangled name specified as --add argument. So if perf or such were to keep a mapping between the two and report that back, that would work for my particular use case as well. This procedure may not amount to demangling in the general, but it would cover some use cases without Rust people having to work on it.

sanmai-NL commented 1 year ago

It should be possible to have something like an api where you specify in the input a wildcard for the crate disambiguator and then the name mangling library would output a wildcard where it would otherwise print the crate disambiguator. Would this work for your use case?

Yes, sure. And perhaps there are other, forensic cases and such. Please note, I'm not an expert or involved in this mangling work here or elsewhere, just chiming in as a user with a practical use case that I think will be relevant to a subgroup of real-world developers (not detailing it since it's part of a paper to be published).

benpye commented 1 year ago

Now that RFC rust-lang/rfcs#3224 has been merged to resolve the suffix question, are there any implementations that still need updating in order to handle . and $ suffixes? If not, shall we revive #89917 for making v0 the default?

As far as I can tell gdb does not support suffixes using $ rather than .. It's also somewhat unfortunate that GDB strips the suffix, rather than including it in the demangled string - but that's not unbearable.

bstrie commented 11 months ago

@benpye Given that the only documented use of $ suffixes in the wild is for thread-local data on Mach-O, I don't think it's a showstopper for shipping this as the default.

I wonder if the compiler team would like to use the upcoming 2024 edition as an excuse to finally ship v0 mangling? This would let us roll it out gradually and in a way that can be easily rolled back by users, and since it's an implementation detail we could make it the default for all editions someday in the future if we really wanted to.

rust-lang / rust

Tracking issue for RFC 2603, "Rust Symbol Mangling (v0)" #60705