TLS destructors are not run on Library::drop resulting in illegal instruction on OS X

benanders commented 8 years ago

Hi, thanks for making this library, it's really useful to me.

Unfortunately, when trying out a really simple use case, I get an Illegal Hardware Instruction error. The Rust code I'm using to load the dylib is:

extern crate libloading;

use libloading::{Library, Symbol};

fn main() {
    let lib = Library::new("../test/target/release/libtest.dylib").unwrap();
    let sym: Symbol<extern fn() -> ()> = unsafe {lib.get(b"testing")}.unwrap();
    sym();
}

The dylib I'm loading contains a single function:

#[no_mangle]
pub fn testing() {
    println!("YES!");
}

The dylib's Cargo.toml file contains the needed crate-type = ["dylib"] qualifier.

I wrote some equivalent C code (loading the exact same Rust library from above), which works perfectly fine (no errors):

#include <stdio.h>
#include <dlfcn.h>

int main(int argc, char *argv[]) {
    void *lib = dlopen("../test/target/release/libtest.dylib", RTLD_LAZY);
    void (*sym)(void) = dlsym(lib, "testing");
    sym();
}

Any ideas why this might be happening? The illegal hardware instruction occurs after the main Rust function exits (I can place a print at the end of the main function and it'll be run, then the error will occur). I'm on OSX 10.11.2, using the most recent stable rust (rustc 1.5.0 (3d7cd77e4 2015-12-04) ).

I narrowed it down to the Drop function on the Library struct. If I comment its contents out, then the error doesn't happen. I also replaced the Drop function with just a single call to dlclose like so:

    fn drop(&mut self) {
        println!("{}", unsafe { dlclose(self.handle) });
    }

Which prints 0 (meaning the close function didn't return an error), which is weird.

benanders commented 8 years ago

I also just tried this on the latest nightly rustc 1.7.0-nightly (110df043b 2015-12-13) and I have the same problem in both release and debug mode (with and without the --release flag for cargo).

nagisa commented 8 years ago

Interesting. I cannot reproduce the error on Linux and do not possess an OS X machine to test this on, so I can’t really help debugging this other than with general tips for debugging this kind of problems.

Illegal instruction in rust usually comes from the ud2 instruction which is emitted in certain cases by the compiler and where intrinsics::unreachable was used. Illegal instruction also may be caused by lack of panic handling setup – unwinding through FFI boundary is illegal in rust (but since both caller and callee are both Rust, I can’t imagine this being a problem).

If you’re interested in tracking down and fixing the issue, please do so! Otherwise I’ll just keep the issue open for a while so other people could find this if they hit it as well (on stable or not).

nagisa commented 8 years ago

Interesting places to start tracking down the issue would be a stack trace at the time of hardware fault and disassembly around the invalid insn, I guess.

benanders commented 8 years ago

Honestly I'd have no idea where to start in debugging something like this, I'm relatively inexperienced with particularly low level stuff, but I'd like to try getting the issue resolved. Can you make any sense of this backtrace from GDB?

Program received signal SIGSEGV, Segmentation fault.
0x0000000101416630 in ?? ()
(gdb) bt
#0  0x0000000101416630 in ?? ()
#1  0x00007fff82e18155 in tlv_finalize () from /usr/lib/system/libdyld.dylib
#2  0x00007fff818fe768 in exit () from /usr/lib/system/libsystem_c.dylib
#3  0x00007fff82e185b4 in start () from /usr/lib/system/libdyld.dylib
#4  0x00007fff82e185ad in start () from /usr/lib/system/libdyld.dylib
#5  0x0000000000000000 in ?? ()

The disassembly from the function above the ?? in the stack trace (not sure if this is useful):

(gdb) up
#1  0x00007fff82e18155 in tlv_finalize () from /usr/lib/system/libdyld.dylib
(gdb) disas
Dump of assembler code for function tlv_finalize:
   0x00007fff82e18124 <+0>: push   %rbp
   0x00007fff82e18125 <+1>: mov    %rsp,%rbp
   0x00007fff82e18128 <+4>: push   %r15
   0x00007fff82e1812a <+6>: push   %r14
   0x00007fff82e1812c <+8>: push   %rbx
   0x00007fff82e1812d <+9>: push   %rax
   0x00007fff82e1812e <+10>:    mov    %rdi,%r14
   0x00007fff82e18131 <+13>:    mov    0x4(%r14),%r15d
   0x00007fff82e18135 <+17>:    test   %r15d,%r15d
   0x00007fff82e18138 <+20>:    je     0x7fff82e1815e <tlv_finalize+58>
   0x00007fff82e1813a <+22>:    lea    -0x1(%r15),%eax
   0x00007fff82e1813e <+26>:    shl    $0x4,%rax
   0x00007fff82e18142 <+30>:    lea    0x10(%rax,%r14,1),%rbx
   0x00007fff82e18147 <+35>:    mov    -0x8(%rbx),%rax
   0x00007fff82e1814b <+39>:    test   %rax,%rax
   0x00007fff82e1814e <+42>:    je     0x7fff82e18155 <tlv_finalize+49>
   0x00007fff82e18150 <+44>:    mov    (%rbx),%rdi
   0x00007fff82e18153 <+47>:    callq  *%rax
=> 0x00007fff82e18155 <+49>:    add    $0xfffffffffffffff0,%rbx
   0x00007fff82e18159 <+53>:    dec    %r15d
   0x00007fff82e1815c <+56>:    jne    0x7fff82e18147 <tlv_finalize+35>
   0x00007fff82e1815e <+58>:    mov    %r14,%rdi
   0x00007fff82e18161 <+61>:    add    $0x8,%rsp
   0x00007fff82e18165 <+65>:    pop    %rbx
   0x00007fff82e18166 <+66>:    pop    %r14
   0x00007fff82e18168 <+68>:    pop    %r15
   0x00007fff82e1816a <+70>:    pop    %rbp
   0x00007fff82e1816b <+71>:    jmpq   0x7fff82e185bc

I take it since there's no ud2 instruction that that's not the problem? GDB won't let me get the disassembly for the function that's actually triggering the fault.

nagisa commented 8 years ago

Hmm, at a first sight it probably has nothing to do with the implementation of this library. Rather, Rust (and all other languages’) programs have some thread local storage set up. For Rust, things like printing have some TLS set-up, and it might be a case of TLS getting corrupted for the whole program (e.g. a case similar to double-free, where rust Runtime gets unloaded twice?). I’m not sure.

If you don’t mind leaking the loaded library (i.e. library you load is used more than once, perhaps, for the duration of the whole program), I can suggest you forgetting the library so it doesn’t execute these cleanups. That should at least avoid the issue.

benanders commented 8 years ago

Yeah that seems like the best option so far. I'm not rapidly opening and closing libraries where resource management is important, so leaking is the easiest way out. I hadn't seen mem::forget, thanks for that!

nagisa commented 8 years ago

According to @alexcrichton, it is very likely to be a case of the library registering some TLS destructors with pthreads, but they’re executed only when the thread itself finishes, rather than when the library is unloaded, thus resulting in us executing code that does not exist anymore. Apparently, there have been cases in a past where this has been encountered as well.

In this case, I’d say this is a bug in OS X itself (or its libdyld/pthreads) with suggested fix to “forget” the loaded library. Note, that not using any TLS related features (this includes anything related to stdio in Rust) would also avoid this bug.

calebmer commented 8 years ago

What's the status on this? We would look to use this library and OS X support is required. This bug is a major blocker. A couple specific questions:

If this is a bug in someone else's code, have the appropriate issues been filed? If so, are there any links to those issues?
If there are workarounds (as you mention) are there specific examples of code that doesn't work vs code that does?
Is there any progress on code being added to the library to workaround this bug?
Are there any libraries besides this one that serve the same function and don't have troubles with OS X?

emoon commented 8 years ago

Running into this issue also so wondering the same thing if this is being tracked else where?

benanders commented 8 years ago

As far as I know, no other issues have been filed. Last time I check, there are no other libraries for Rust which serve the same purpose as this one. As for a workaround, I don't believe one is being worked on, and I unfortunately don't have the time, knowledge, or experience to try and fix this myself. I think we're out of luck at the moment :(

As far as who should be responsible for the bug, I'm not entirely sure. It might be a bug in Rust itself, because it doesn't seem to be specific to this library. But I'm not sure how willing the Rust maintainers would be to attempt to go about fixing it, since it involves the use of unsafe code and a native C library.

nagisa commented 8 years ago

I’m not aware of any issues reported in other projects, nor I am aware of a public OS X issue tracker of any sort where such an issue could be reported/searched for. That being said, I do admit I didn’t look very hard for either one.

nagisa commented 8 years ago

@calebmer sorry for the late response! Your comment completely fell through the cracks! Your’s are all very good questions thus I’ll try to answer them extensively:

If this is a bug in someone else's code, have the appropriate issues been filed? If so, are there any links to those issues?

No. No upstream bugs have been filled, primarily because I’m not very familiar with the OS X community or the issue reporting process. Last time I checked it needed one to pay 100 USD upfront even to report an issue in Apple’s own OS.

If there are workarounds (as you mention) are there specific examples of code that doesn't work vs code that does?

Two workarounds are:

Never closing the library which exposes the issue (e.g. mem::forget(library) after the necessary symbols are retrieved), as mentioned previously;
Ensuring the loaded library does not invoke thread-local functionality, but that might not be always feasible. Writing external libraries in languages which do not rely on TLS as extensively as Rust might help. Not using the Rust standard library (#[no_std]) would also make this easier.

Is there any progress on code being added to the library to workaround this bug?

I’m not sure it is possible to resolve this issue from in this library properly. An option would be to leak all the opened libraries by default on OS X, but I wouldn’t consider that a viable option.

Are there any libraries besides this one that serve the same function and don't have troubles with OS X?

You could certainly use barebones dlopen and dlsym and dlclose, but you would almost certainly hit the same issue as with this library. Avoiding dlopen would involve writing a whole dynamic linker for the platform of your choice by yourself.

@GravityScore you said

since it involves the use of unsafe code and a native C library

What do you mean? Rust’s standard library on OS X is strongly tied to the standard libc and contains a big amount of unsafe code. If using some additional unsafe code in the standard library would avoid the issue, I think the fix would be gladly accepted; though, I don’t think it would solve the issue in general: one could still produce a library which could use TLS in a way which would expose this issue regardless of what’s done in the Rust compiler or the standard library.

emoon commented 8 years ago

I have a question here (this is somewhat generic to Rust but bare with me) So I keep track of Library with in a struct here https://github.com/emoon/dynamic_reload/blob/master/src/lib.rs#L44 that is the later stuffed into a Vec<Rc<Lib>> So I wonder how I should do the forget in this case? Should I implement Drop for the struct that holds this data and then do mem::forget on lib

nagisa commented 8 years ago

@emoon I guess the least intrusive way stable way currently would be to do something like this and then wrap your Library into the Leak.

struct Leak<T>(Option<T>);

impl<T> Drop for Leak<T> {
    fn drop(&mut self) {
        ::std::mem::forget(self.0.take());
    }
}

emoon commented 8 years ago

@nagisa Alright. Thanks!

calebmer commented 8 years ago

@nagisa totally understand the delay, thanks for the great response! 😊

MasonRemaley commented 8 years ago

I believe this is caused by #28794, if I understand correctly it's an issue with the way the Rust compiler generates dylibs.

(I think you'd get the same crash in C if you called dlclose, but that you wouldn't get the crash from either language if the library being loaded wasn't written in Rust.)

nagisa commented 8 years ago

if I understand correctly it's an issue with the way the Rust compiler generates dylibs.

There’s nothing specific with the dylib generation, but rather with how Rust standard library implements the TLS on OS X.

but that you wouldn't get the crash from either language if the library being loaded wasn't written in Rust.)

You could use/implement TLS destructors using that function in any other language and hit exactly the same issues too.

Either way, thanks for finding and cross-referencing the issue.

MasonRemaley commented 8 years ago

No problem, thanks for the explanation!

nagisa commented 8 years ago

Since 0.3 you can specify arbitrary flags when opening a library. The RTLD_NODELETE (thanks for reminder @Np2x) essentially acts as implicit mem::forget so you can now do something along the lines of:

let os_lib = libloading::os::unix::Library::open("fname", RTLD_NODELETE | RTLD_NOW)?;
let lib = libloading::Library::from(os_lib);
/* do your stuff */

This should still work while liberating you from having to mem::forget your libraries :)

nagisa commented 6 years ago

As per this comment, Apple has fixed this issue by implementing dynamic library unloading, if said dynamic libraries use TLS, as a no-op.

nagisa / rust_libloading

TLS destructors are not run on Library::drop resulting in illegal instruction on OS X #5