Closed clytras closed 7 years ago
I was able to confirm this in a clean centos:6.8
docker image so I doubt it's something specific to your system. This segfault also went all the way back to Rust 1.0, so I don't think it's a regression anywhere along the way. I was unfortunately unable to get any good error messages out of it or pinpoint a cause, but TLS destructors seemed definitely related somehow to what's going on here.
I am seeing this for 1.15.1 as well as for 1.16 (beta) on SL 6.8.
This is the only function I see being registered for running at exit:
(gdb) c
Continuing.
Breakpoint 3, __cxa_atexit (func=0xb7fef590 <_dl_fini>, arg=0x0, d=0x0) at cxa_atexit.c:58
58 return __internal_atexit (func, arg, d, &__exit_funcs);
(gdb) p func
$16 = (void (*)(void *)) 0xb7fef590 <_dl_fini>
(gdb) p arg
$17 = (void *) 0x0
(gdb) p d
$18 = (void *) 0x0
(gdb) p __exit_funcs
$19 = (struct exit_function_list *) 0xb7f8b160
But when I continue, the SIGSEGV seesms to occur elsewhere.
(gdb) c
Continuing.
Hello, world!
Breakpoint 1, exit (status=0) at exit.c:99
99 {
(gdb) list
94 }
95
96
97 void
98 exit (int status)
99 {
100 __run_exit_handlers (status, &__exit_funcs, true);
101 }
102 libc_hidden_def (exit)
(gdb) step
100 __run_exit_handlers (status, &__exit_funcs, true);
(gdb) p __exit_funcs
$20 = (struct exit_function_list *) 0xb7f8b160
(gdb) p (struct exit_function_list) *__exit_funcs
$21 = exit_function_list = {
next = 0x0,
idx = 1,
fns = {exit_function = {flavor = 4, func = {at = 0x8bf6ba11, on = Traceback (most recent call last):
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/gdb_rust_pretty_printing.py", line 100, in rust_pretty_printer_lookup_function
type_kind = val.type.get_type_kind()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 122, in get_type_kind
self.__type_kind = self.__classify_struct()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 143, in __classify_struct
if (unqualified_type_name.startswith(("&[", "&mut [")) and
AttributeError: 'NoneType' object has no attribute 'startswith'
{fn = 0x8bf6ba11, arg = 0x0}, cxa = Traceback (most recent call last):
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/gdb_rust_pretty_printing.py", line 100, in rust_pretty_printer_lookup_function
type_kind = val.type.get_type_kind()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 122, in get_type_kind
self.__type_kind = self.__classify_struct()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 143, in __classify_struct
if (unqualified_type_name.startswith(("&[", "&mut [")) and
AttributeError: 'NoneType' object has no attribute 'startswith'
{fn = 0x8bf6ba11, arg = 0x0, dso_handle = 0x0}}}, exit_function = {flavor = 0, func = {at = 0, on = Traceback (most recent call last):
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/gdb_rust_pretty_printing.py", line 100, in rust_pretty_printer_lookup_function
type_kind = val.type.get_type_kind()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 122, in get_type_kind
self.__type_kind = self.__classify_struct()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 143, in __classify_struct
if (unqualified_type_name.startswith(("&[", "&mut [")) and
AttributeError: 'NoneType' object has no attribute 'startswith'
{fn = 0, arg = 0x0}, cxa = Traceback (most recent call last):
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/gdb_rust_pretty_printing.py", line 100, in rust_pretty_printer_lookup_function
type_kind = val.type.get_type_kind()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 122, in get_type_kind
self.__type_kind = self.__classify_struct()
File "/usr/local/vpw/rust/rust-beta/lib/rustlib/etc/debugger_pretty_printers_common.py", line 143, in __classify_struct
if (unqualified_type_name.startswith(("&[", "&mut [")) and
AttributeError: 'NoneType' object has no attribute 'startswith'
{fn = 0, arg = 0x0, dso_handle = 0x0}}} <repeats 31 times>}
}
(gdb) step
__run_exit_handlers (status=0) at exit.c:42
42 while (*listp != NULL)
<snip>
(gdb) list
73 case ef_cxa:
74 cxafct = f->func.cxa.fn;
75 #ifdef PTR_DEMANGLE
76 PTR_DEMANGLE (cxafct);
77 #endif
78 cxafct (f->func.cxa.arg, status);
79 break;
80 }
81 }
(gdb) p f->func.cxa.fn
$27 = (void (*)(void *, int)) 0x8bf6ba11
(gdb) p f->func.cxa.arg
$28 = (void *) 0x0
(gdb) step
78 cxafct (f->func.cxa.arg, status);
(gdb)
Program received signal SIGSEGV, Segmentation fault.
0x08c5fb5d in ?? ()
Looking a bit further, I get here
0xb7e25e80 <+192>: shl $0x4,%eax
0xb7e25e83 <+195>: lea (%esi,%eax,1),%eax
0xb7e25e86 <+198>: mov 0xc(%eax),%edx
0xb7e25e89 <+201>: mov %edi,0x4(%esp)
0xb7e25e8d <+205>: mov 0x10(%eax),%eax
0xb7e25e90 <+208>: ror $0x9,%edx
0xb7e25e93 <+211>: xor %gs:0x18,%edx
0xb7e25e9a <+218>: mov %eax,(%esp)
=> 0xb7e25e9d <+221>: call *%edx
0xb7e25e9f <+223>: jmp 0xb7e25de8 <exit+40>
0xb7e25ea4 <+228>: nopl 0x0(%eax)
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) info registers
eax 0x0 0
ecx 0xb7f8c168 -1208434328
edx 0x2f01c865 788645989
ebx 0xb7f8aff4 -1208438796
esp 0xbffff600 0xbffff600
ebp 0xbffff638 0xbffff638
esi 0xb7f8c160 -1208434336
edi 0x0 0
eip 0xb7e25e9d 0xb7e25e9d <exit+221>
eflags 0x206 [ PF IF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
(gdb) x 0x2f01c865
0x2f01c865: Cannot access memory at address 0x2f01c865
The demangled edx value on no run becomes the original value again.
The original value gets mangled:
=> 0xb7e260f9 <+41>: mov 0x8(%ebp),%edx
0xb7e260fc <+44>: xor %gs:0x18,%edx
0xb7e26103 <+51>: rol $0x9,%ed
Turning
(gdb) p func
$24 = (void (*)(void *)) 0xb7fef590 <_dl_fini>
into
(gdb) p func
$25 = (void (*)(void *)) 0x36834987
At the end, we find the same value again:
gdb) step
46 while (cur->idx > 0)
(gdb)
78 cxafct (f->func.cxa.arg, status);
(gdb)
46 while (cur->idx > 0)
(gdb)
50 switch (f->flavor)
(gdb)
49 &cur->fns[--cur->idx];
(gdb)
50 switch (f->flavor)
(gdb)
74 cxafct = f->func.cxa.fn;
(gdb) p f->func.cxa.fn
$28 = (void (*)(void *, int)) 0x36834987
Moving through the mangling:
=> 0xb7e25e80 <+192>: shl $0x4,%eax
0xb7e25e83 <+195>: lea (%esi,%eax,1),%eax
0xb7e25e86 <+198>: mov 0xc(%eax),%edx
0xb7e25e89 <+201>: mov %edi,0x4(%esp)
0xb7e25e8d <+205>: mov 0x10(%eax),%eax
0xb7e25e90 <+208>: ror $0x9,%edx
0xb7e25e93 <+211>: xor %gs:0x18,%edx
0xb7e25e9a <+218>: mov %eax,(%esp)
0xb7e25e9d <+221>: call *%edx
And then we get ...
=> 0xb7e25e9a <+218>: mov %eax,(%esp)
0xb7e25e9d <+221>: call *%edx
0xb7e25e9f <+223>: jmp 0xb7e25de8 <exit+40>
0xb7e25ea4 <+228>: nopl 0x0(%eax)
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) info registers
eax 0x0 0
ecx 0xb7f8c168 -1208434328
edx 0xc39b41a4 -1013235292
ebx 0xb7f8aff4 -1208438796
esp 0xbffff600 0xbffff600
ebp 0xbffff638 0xbffff638
esi 0xb7f8c160 -1208434336
edi 0x0 0
eip 0xb7e25e9a 0xb7e25e9a <exit+218>
eflags 0x282 [ SF IF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
So, we now looking to call the function at 0xc39b41a4, which is totally not what was the original target before mangling.
So something messed with the guard value?
If it's any help, a program with an empty main gives this:
# RUST_BACKTRACE=1 ./test-empty
thread '<unnamed>' panicked at 'cannot access a TLS value during or after it is destroyed', /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libcore/option.rs:715
stack backtrace:
1: 0xfa37f5 - std::sys::imp::backtrace::tracing::imp::write::h23bcdb89e70c5bbf
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
2: 0xfa54dd - std::panicking::default_hook::{{closure}}::he7b82439fd2d2bb6
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:351
3: 0xfa5089 - std::panicking::default_hook::he1cd4269c1558f23
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:367
4: 0xfa58d9 - std::panicking::rust_panic_with_hook::h006b37e36b7c8982
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:555
5: 0xfa572c - std::panicking::begin_panic::h043cddfdd3933cc4
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:517
6: 0xfa56a3 - std::panicking::begin_panic_fmt::h34e588bba6b8a2c2
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:501
7: 0xfa5616 - rust_begin_unwind
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/panicking.rs:477
8: 0xfcd15f - core::panicking::panic_fmt::he52644573ecd78ff
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libcore/panicking.rs:69
9: 0xfcd204 - core::option::expect_failed::h64a81ddcb7418e4e
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libcore/option.rs:715
10: 0xfa33a0 - std::sys_common::thread_info::set::h80f3df40e0ba4361
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libcore/option.rs:293
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/thread/local.rs:251
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/sys_common/thread_info.rs:51
11: 0xfa5c5e - std::rt::lang_start::h1ef940195e3c010e
at /buildslave/rust-buildbot/slave/beta-dist-rustc-linux/build/src/libstd/rt.rs:51
12: 0xf9efa0 - main
13: 0x1d1d25 - __libc_start_main
14: 0xf9ee80 - <unknown>
fatal runtime error: failed to initiate panic, error 5
Aborted
I do notice the issue does not occur when creating a binary for i686-unknown-linux-gnu on an x86_64-unknown-linux-gnu platform using the --target option.
Thanks for the investigation @itkovian!
I wonder if there's a stray out write causing this problem? Or maybe it's a runtime bug we're running into?
I just discovered this issue myself, ouch! I wasn't even doing hello, just fn main(){}
.
I found that the binary compiled on RHEL6 won't run on Fedora either, so it's not just runtime libraries. And the binary compiled on Fedora can't run on RHEL6 because it has a versioned GLIBC_2.18
symbol. But that symbol seems relevant to what @itkovian was looking at:
Fedora 24 i686:
# echo 'fn main(){}' | rustc - -o true
# nm ./true | grep atexit
00034820 t atexit
U __cxa_atexit@@GLIBC_2.1.3
w __cxa_thread_atexit_impl@@GLIBC_2.18
00016ac0 t stats_print_atexit
# ./true
# echo $?
0
RHEL6 i686:
# echo 'fn main(){}' | rustc - -o true
# nm ./true | grep atexit
U __cxa_atexit@@GLIBC_2.1.3
w __cxa_thread_atexit_impl
00034f20 t atexit
000171c0 t stats_print_atexit
# ./true
thread '<unnamed>' panicked at 'cannot access a TLS value during or after it is destroyed', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/option.rs:715
note: Run with `RUST_BACKTRACE=1` for a backtrace.
fatal runtime error: failed to initiate panic, error 5
Aborted
I do notice the issue does not occur when creating a binary for i686-unknown-linux-gnu on an x86_64-unknown-linux-gnu platform using the --target option.
Was your x86_64 host also EL6? I do still see the issue when cross-compiling this way. But the native x86_64 target is fine.
Ah! I forgot to check that @cuviper. It's CO 7.3, so the answer would be no.
Actually, that's interesting. EL7 has glibc-2.17, so it's result doesn't have that GLIBC_2.18
dependency, and in fact that i686 binary runs just fine on EL6 i686! Hmm, now I want to try EL5 too, and I'll see if I can get our other toolchain folks to help me look at this.
now I want to try EL5 too
This works fine!
I believe it is this binutils bug: https://sourceware.org/bugzilla/show_bug.cgi?id=12654
@cuviper wow that must have been quite the investigation! Does that mean if you re-link the binary without -pie
it works?
If that's the cause here then I'm not sure what we can do about it :(
Yeah, it was fun. Kudos to @fweimer for identifying the actual fix. Here's the RHEL6 bug: https://bugzilla.redhat.com/show_bug.cgi?id=1427285
So yeah, linking without -pie
works, but I'm not sure of a clean way to get rustc to do that. I used -Csave-temps -Zprint-link-args
and then removed the -pie
manually. I tried -Crelocation-model=static
, but that doesn't seem to override the target options here, which may be a bug in itself.
There's probably some way to get rustc to use your own fixed ld
too, until the system ld
can be fixed.
Ok in that case I think I'll go ahead and close. Thanks though for the investigation @cuviper!
FYI, the fix is in binutils-2.20.51.0.2-5.47.el6_9.1
(errata), and it looks like CentOS has shipped it too.
Thanks for the update @cuviper!
After installing rust, I tried the sample "Hello World!" program. It compiles just fine, but when I run it, it throws a segmentation fault after the "Hello World!" output.
This happens to a local server running CentOS 6.8 x86. This issue does not occur on a CentOS 6.8 x86_64. I am not sure if this is something that has to do with my OS installation or it's related to rust.