tensorflow / rust

Rust language bindings for TensorFlow
Apache License 2.0
5.18k stars 422 forks source link

Tensorflow is taking over my openssl and causing segfaults #417

Open msdrigg opened 1 year ago

msdrigg commented 1 year ago

So I recently added tensorflow to a rust project that had an external openssl dependency (reqwests and paho-mqtt) and I immediately started seeing segfaults. The strange thing is that these segfaults are coming from crypto functions being called in the tensorflow_framework.so.2 library from from paho-mqtt (SSLSocket_initialize in the core dump shown below). If I remove the paho-mqtt dependency on ssl, I see similar things with reqwests

Relevant Logs

This backtrace reliably occurs everytime I run my program.

(gdb) bt
#0  __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x0)
    at ./nptl/pthread_rwlock_common.c:603
#1  ___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
#2  0x00007f8ec0e6db69 in CRYPTO_STATIC_MUTEX_lock_write ()
   from /home/myuser/workspace/target/debug/build/tensorflow-sys-b3a831e1f8b18f5e/out/libtensorflow_framework.so.2
#3  0x00007f8ec0df6263 in CRYPTO_get_ex_new_index ()
   from /home/myuser/workspace/target/debug/build/tensorflow-sys-b3a831e1f8b18f5e/out/libtensorflow_framework.so.2
#4  0x0000564ee8a50b43 in SSLSocket_initialize ()
    at /home/myuser/.cargo/registry/src/index.crates.io-6f17d22bba15001f/paho-mqtt-sys-0.8.1/paho.mqtt.c/src/SSLSocket.c:492
#5  0x0000564ee8a440ff in MQTTAsync_createWithOptions (handle=0x7f8ea4bdfe00, 
    serverURI=0x7f8df4004fc0 "tcp://localhost:1883", 
    clientId=0x7f8df4004fe0 "program", persistence_type=1, 
    persistence_context=0x0, options=0x7f8ea4bdfcc8)
    at /home/myuser/.cargo/registry/src/index.crates.io-6f17d22bba15001f/paho-mqtt-sys-0.8.1/paho.mqtt.c/src/MQTTAsync.c:372
#6  0x0000564ee8a22c37 in paho_mqtt::async_client::AsyncClient::new<paho_mqtt::create_options::CreateOptions> (opts=...) at src/async_client.rs:201
#7  0x0000564ee8a2127a in paho_mqtt::create_options::CreateOptionsBuilder::create_client (self=...)
    at src/create_options.rs:444

Interestingly, here's what I see from ldd. Note that libssl.so.3 does correctly point to the real openssl, so I don't know why at runtime it gets linked to tensorflow_framework.so.2

$ldd target/debug/program
        linux-vdso.so.1 (0x00007ffc46ffe000)
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x00007fb1b0000000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x00007fb19f000000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007fb1b767e000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fb19ea00000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb1b765e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb19ef19000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb19e600000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb1b773c000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb19e200000)

Note: I am using the latest rust versions and the latest versions of all packages mentioned here. Here's what my uname -a output looks like:

Linux pop-os 6.4.6-76060406-generic #202307241739~1690928105~22.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux

Prior Art

The only other mention of this issue I could find was here https://github.com/tensorflow/tensorflow/issues/34742, and I am currently trying to resolve my problem using the steps outlined in that issue.

Goals

A perfect fix would be for me to be able to seamlessly use tensorflow and openssl in a project without any tweaks, but I would consider this issue closed for me if we could find some workaround (environmental variables, build script or something similar) so that I could make my project run without segfaulting.

msdrigg commented 1 year ago

I tried all solutions mentioned in https://github.com/tensorflow/tensorflow/issues/34742, and nothing works. My final attempt was bazel build --compilation_mode=opt --jobs=25 --config=noaws --config=nogcp --config=nohdfs --config=nonccl --config=monolithic tensorflow and it still did not solve the problem.

adamcrume commented 1 year ago

Are you pointing Rust to the TensorFlow library you built? There are instructions on how to do that at https://github.com/tensorflow/rust/blob/master/tensorflow-sys/README.md#manual-tensorflow-compilation.

msdrigg commented 1 year ago

Yes, I moved the compiled objects into /usr/local/lib and ran ldconfig on the directory.

treehaqr commented 5 months ago

+1 for me. All I do is open a http connection with the reqwest crate and it crashes. It's totally unrelated to tensorflow, but somehow it now takes ownership of openssl lib.

  * frame #0: 0x00007fffcc6969fc libc.so.6`__GI___pthread_kill at pthread_kill.c:44:76
    frame #1: 0x00007fffcc6969b0 libc.so.6`__GI___pthread_kill [inlined] __pthread_kill_internal(signo=6, threadid=140737314203328) at pthread_kill.c:78:10
    frame #2: 0x00007fffcc6969b0 libc.so.6`__GI___pthread_kill(threadid=140737314203328, signo=6) at pthread_kill.c:89:10
    frame #3: 0x00007fffcc642476 libc.so.6`__GI_raise(sig=6) at raise.c:26:13
    frame #4: 0x00007fffcc6287f3 libc.so.6`__GI_abort at abort.c:79:7
    frame #5: 0x00007fffcc689676 libc.so.6`__libc_message(action=do_abort, fmt="\U00000010") at libc_fatal.c:155:5
    frame #6: 0x00007fffcc6a0cfc libc.so.6`malloc_printerr(str=<unavailable>) at malloc.c:5664:3
    frame #7: 0x00007fffcc6a2a44 libc.so.6`_int_free(av=<unavailable>, p=<unavailable>, have_lock=0) at malloc.c:4439:5
    frame #8: 0x00007fffcc6a5453 libc.so.6`__GI___libc_free(mem=<unavailable>) at malloc.c:3391:7
    frame #9: 0x00007ffff70a1c9a libtensorflow_framework.so.2`bssl::ssl_crypto_x509_ssl_ctx_free(ssl_ctx_st*) + 58
    frame #10: 0x00007ffff7094f86 libtensorflow_framework.so.2`ssl_ctx_st::~ssl_ctx_st() + 70
    frame #11: 0x00007ffff7095456 libtensorflow_framework.so.2`SSL_CTX_free + 38
    frame #12: 0x00005555565c990e program`_$LT$openssl..ssl..SslContext$u20$as$u20$core..ops..drop..Drop$GT$::drop::he1e1bafd7778b929(self=0x00007fffcbe22000) at lib.rs:241:26
    frame #13: 0x00005555565d58da program`core::ptr::drop_in_place$LT$openssl..ssl..SslContext$GT$::h8483f3eb796b6aee((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #14: 0x000055555600318b program`core::ptr::drop_in_place$LT$openssl..ssl..connector..SslConnector$GT$::ha52c6b5831405ca0((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #15: 0x000055555600316b program`core::ptr::drop_in_place$LT$native_tls..imp..TlsConnector$GT$::h898a325e5e6a2390((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #16: 0x000055555600315b program`core::ptr::drop_in_place$LT$native_tls..TlsConnector$GT$::h05dcf2f2ec19f859((null)=0x00007fffcbe22000) at mod.rs:497:1
    frame #17: 0x0000555555f4215c program`core::ptr::drop_in_place$LT$reqwest..connect..Inner$GT$::h1ccf2f0fb635dba6((null)=0x00007fffcbe21fe8) at mod.rs:497:1
    frame #18: 0x0000555555f425cb program`core::ptr::drop_in_place$LT$reqwest..connect..Connector$GT$::hbfb676efb078b00f((null)=0x00007fffcbe21fd8) at mod.rs:497:1
    frame #19: 0x0000555555f3a42e program`core::ptr::drop_in_place$LT$hyper_util..client..legacy..client..Client$LT$reqwest..connect..Connector$C$reqwest..async_impl..body..Body$GT$$GT$::h6a79d474bb243160((null)=0x00007fffcbe21f10) at mod.rs:497:1
    frame #20: 0x0000555555f4346c program`core::ptr::drop_in_place$LT$reqwest..async_impl..client..ClientRef$GT$::h621ffac56c4ab15f((null)=0x00007fffcbe21f10) at mod.rs:497:1
    frame #21: 0x0000555555f0ee3f program`alloc::sync::Arc$LT$T$C$A$GT$::drop_slow::h6322ecbb95a2aa20(self=0x00007fffffff3540) at sync.rs:1751:18
    frame #22: 0x0000555555f147e5 program`_$LT$alloc..sync..Arc$LT$T$C$A$GT$$u20$as$u20$core..ops..drop..Drop$GT$::drop::h25fd65b8fc0fe2ed(self=0x00007fffffff3540) at sync.rs:2407:13
    frame #23: 0x0000555555f4485b program`core::ptr::drop_in_place$LT$alloc..sync..Arc$LT$reqwest..async_impl..client..ClientRef$GT$$GT$::h15126958e085d642((null)=0x00007fffffff3540) at mod.rs:497:1
    frame #24: 0x0000555555f431db program`core::ptr::drop_in_place$LT$reqwest..async_impl..client..Client$GT$::h9be9cca15fbe82e9((null)=0x00007fffffff3540) at mod.rs:497:1
ldd target/debug/program
        linux-vdso.so.1 (0x00007ffef968b000)
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x0000783696400000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x000078366de00000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x000078369895c000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x000078366d800000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000078366d400000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x000078369ba1f000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000078369b9ff000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000078366dd19000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000078366d000000)
        /lib64/ld-linux-x86-64.so.2 (0x000078369ba80000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000078369b9f8000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000078369b9f3000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000078369b9ee000)
treehaqr commented 5 months ago

I worked around this by placing libssl and libcrypto before tensorflow in order of priority above. Create a build.rs with this code:

use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
    println!("cargo:rustc-link-lib=dylib=ssl");
    println!("cargo:rustc-link-lib=dylib=crypto");
    Ok(())
}

and note that libssl and libcrypto are not before libtensorflow so it would never try to use tensorflow's statically linked ssl:

        linux-vdso.so.1 (0x00007fffbb3a9000)
        libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x0000772aa0f5c000) <-- here
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x0000772aa0a00000) <-- here
        libtensorflow_framework.so.2 => /usr/local/lib/libtensorflow_framework.so.2 (0x0000772a9b400000)
        libtensorflow.so.2 => /usr/local/lib/libtensorflow.so.2 (0x0000772a76000000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000772a75c00000)
        libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x0000772aa401c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000772aa3ffc000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000772aa0e75000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000772a75800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000772aa407d000)

It's still broken for unit tests because to my knowledge there's no way to enforce the linking order in tests.

Ideally libtensorflow should never be statically linked to openssl and let the binary choose its own libssl.