sfackler / r2d2

A generic connection pool for Rust
Apache License 2.0
1.51k stars 82 forks source link

Seg fault with Postgres #137

Closed eloff closed 1 year ago

eloff commented 1 year ago

I posted this to diesel, but the issues appears to be with r2d2. All the core dumps are r2d2 workers.

https://github.com/diesel-rs/diesel/issues/3441

Repro (see README for screenshot of one stacktrace): https://github.com/eloff/diesel-segfault

Versions

I can't reproduce this issue in the docker-compose I provided, just on my Ubuntu 20.04 dev machine. Any idea why that might be? Anything I can test?

sfackler commented 1 year ago

There is zero unsafe code in this library. It is extremely unlikely the crash is due to a bug in r2d2.

The screenshot of the stacktrace in that repo is of a thread blocked in a futex_wait call, which is not a location would be triggering a segfault. You should find the faulting thread.

eloff commented 1 year ago

With no unsafe code, it does seem unlikely. It must be related to the interplay between the libpq based driver in diesel and r2d2. All the core dumps are for r2d2 worker threads.

Note that the crash doesn't happen if I comment out either of the min_idle or max_size lines:

Pool::builder()
        .min_idle(Some(10))
        .max_size(96)
        .build(manager)
        .expect("Failed to create pool.")
sfackler commented 1 year ago

All the core dumps are for r2d2 worker threads.

What specifically do you mean by that? Core dumps are normally a dump of process memory, including all threads in the application at the time of the crash.

eloff commented 1 year ago

I use Python in my day job, so apologies if I'm uneducated here:

cat /proc/sys/kernel/core_pattern
core.%e.%p
ls core.*
core.r2d2-worker-0.245652  core.r2d2-worker-1.121711  core.r2d2-worker-1.260255
core.r2d2-worker-0.254359  core.r2d2-worker-1.208703  core.r2d2-worker-2.208734
core.r2d2-worker-0.254489  core.r2d2-worker-1.245542
core.r2d2-worker-0.254602  core.r2d2-worker-1.245628
sfackler commented 1 year ago

I believe that indicates that the crashing thread was an r2d2 worker thread, but not the one that is included in the screenshot.

eloff commented 1 year ago

All the core dumps are for futex_wait, called from run() -> condvar -> parking_lot:

See: https://github.com/eloff/diesel-segfault/blob/main/coredump.png

sfackler commented 1 year ago

There are multiple thread in each core dump. Click the drop-down that says Thread-4.

eloff commented 1 year ago

That is thread-4, the one that crashed. Here are the others:

https://github.com/eloff/diesel-segfault/blob/main/thread-1.png https://github.com/eloff/diesel-segfault/blob/main/thread-2.png https://github.com/eloff/diesel-segfault/blob/main/thread-3.png

sfackler commented 1 year ago

I have already said that thread-4 is not the one that crashed. I'm not sure how much more explicit I can be.

Thread 1 is trying to take a lock inside of OpenSSL while thread 2 is performing cleanup inside of an atexit handle set by OpenSSL. This is probably https://github.com/openssl/openssl/issues/6214.

eloff commented 1 year ago

I have already said that thread-4 is not the one that crashed. I'm not sure how much more explicit I can be.

Sorry, I thought the little lightning bolt there meant this was the thread that caused the problem. I don't load coredumps often.

You're right and the stacktrace looks similar to the ones in that openssl thread:

* thread #2, name = 'r2d2-worker-0', stop reason = signal SIGSEGV: invalid address (fault address: 0x18)
  * frame #0: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock at pthread_rwlock_common.c:604:7
    frame #1: 0x00007ffff7f21d46 libpthread.so.0`__GI___pthread_rwlock_wrlock(rwlock=0x0000000000000000) at pthread_rwlock_wrlock.c:27
    frame #2: 0x00007ffff7867889 libcrypto.so.1.1`CRYPTO_THREAD_write_lock(lock=<unavailable>) at threads_pthread.c:78
    frame #3: 0x00007ffff7839532 libcrypto.so.1.1`RAND_get_rand_method at rand_lib.c:849
    frame #4: 0x00007ffff78399f9 libcrypto.so.1.1`RAND_status at rand_lib.c:958
    frame #5: 0x00007ffff7f7f545 libpq.so.5`___lldb_unnamed_symbol294$$libpq.so.5 + 37
    frame #6: 0x00007ffff7f5ef2e libpq.so.5`___lldb_unnamed_symbol8$$libpq.so.5 + 142
    frame #7: 0x00007ffff7f76be1 libpq.so.5`___lldb_unnamed_symbol139$$libpq.so.5 + 1873
    frame #8: 0x00007ffff7f6362a libpq.so.5`PQconnectPoll + 3578
    frame #9: 0x00007ffff7f64797 libpq.so.5`___lldb_unnamed_symbol35$$libpq.so.5 + 279
    frame #10: 0x00007ffff7f679e8 libpq.so.5`PQconnectdb + 56
    frame #11: 0x00005555555c78ff acme`diesel::pg::connection::raw::RawConnection::establish::hf66f33a54d350dd8(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at raw.rs:23:39
    frame #12: 0x00005555555cfbf1 acme`_$LT$diesel..pg..connection..PgConnection$u20$as$u20$diesel..connection..Connection$GT$::establish::h55fc3bc2dc70ac2f(database_url=(data_ptr = "postgres://acme:1021f155ba05dd9cbfa2a955@/acme?sslmode=disable", length = 62)) at mod.rs:170:9
    frame #13: 0x000055555557a8aa acme`_$LT$diesel..r2d2..ConnectionManager$LT$T$GT$$u20$as$u20$r2d2..ManageConnection$GT$::connect::hb76e3f16e2530650(self=0x00005555556a0888) at r2d2.rs:192:9
    frame #14: 0x0000555555578e5c acme`r2d2::add_connection::inner::_$u7b$$u7b$closure$u7d$$u7d$::he22fd9af265b9649 at lib.rs:241:24
    frame #15: 0x000055555557a608 acme`scheduled_thread_pool::thunk::Thunk$LT$$LP$$RP$$C$R$GT$::new::_$u7b$$u7b$closure$u7d$$u7d$::hde78cba3c8d5d267((null)=<unavailable>) at thunk.rs:20:35
    frame #16: 0x000055555557a64d acme`_$LT$F$u20$as$u20$scheduled_thread_pool..thunk..Invoke$LT$A$C$R$GT$$GT$::invoke::h6651609b8afd4a0d(self=0x0000555555685680, arg=<unavailable>) at thunk.rs:50:9
    frame #17: 0x00005555555de540 acme`scheduled_thread_pool::thunk::Thunk$LT$A$C$R$GT$::invoke::h0d3f0d3691c28468(self=Thunk<(), ()> @ 0x00007ffff6af7ee0, arg=<unavailable>) at thunk.rs:35:9
    frame #18: 0x00005555555e1007 acme`scheduled_thread_pool::Worker::run_job::h204e5ba542e425b0(self=0x00007ffff6af8940, job=Job @ 0x00007ffff6af83b8) at lib.rs:364:33
    frame #19: 0x00005555555e0a82 acme`scheduled_thread_pool::Worker::run::_$u7b$$u7b$closure$u7d$$u7d$::h9f5de07f20041366 at lib.rs:326:61
    frame #20: 0x00005555555dd750 acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h922b39ce782e6da9(self=<unavailable>, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #21: 0x00005555555e6efa acme`std::panicking::try::do_call::h353343075a44a7a2(data="\b\x88\xaf��) at panicking.rs:492:40
    frame #22: 0x00005555555e7b5b acme`__rust_try + 27
    frame #23: 0x00005555555e6ce6 acme`std::panicking::try::hbd1af4a4897eae11(f=<unavailable>) at panicking.rs:456:19
    frame #24: 0x00005555555dcba1 acme`std::panic::catch_unwind::h8487ef104e1ed0b7(f=<unavailable>) at panic.rs:137:14
    frame #25: 0x00005555555e0a0d acme`scheduled_thread_pool::Worker::run::h0b0fddce95686635(self=0x00007ffff6af8940) at lib.rs:326:21
    frame #26: 0x00005555555e08b0 acme`scheduled_thread_pool::Worker::start::_$u7b$$u7b$closure$u7d$$u7d$::h685180c0a62bc588 at lib.rs:320:30
    frame #27: 0x00005555555dcb6e acme`std::sys_common::backtrace::__rust_begin_short_backtrace::h148f92b33cf1c7fb(f={closure_env#0} @ 0x00007ffff6af8968) at backtrace.rs:122:18
    frame #28: 0x00005555555eb49b acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h6c24d17a05224da0 at mod.rs:514:17
    frame #29: 0x00005555555dd71f acme`_$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h329ec1e2ffc93f6a(self=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af89b0, _args=<unavailable>) at unwind_safe.rs:271:9
    frame #30: 0x00005555555e6f61 acme`std::panicking::try::do_call::h7ad9760e4b97e4ac(data="�,jUUU") at panicking.rs:492:40
    frame #31: 0x00005555555e7b5b acme`__rust_try + 27
    frame #32: 0x00005555555e6d8f acme`std::panicking::try::hf0bb4c1fa863e31d(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8a80) at panicking.rs:456:19
    frame #33: 0x00005555555dcbcf acme`std::panic::catch_unwind::h86a8b4bcde1746d1(f=AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<scheduled_thread_pool::{impl#8}::start::{closure_env#0}, ()>> @ 0x00007ffff6af8ad0) at panic.rs:137:14
    frame #34: 0x00005555555eb26d acme`std::thread::Builder::spawn_unchecked_::_$u7b$$u7b$closure$u7d$$u7d$::h2c1a5cd6dd347e00 at mod.rs:513:30
    frame #35: 0x00005555555e482f acme`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h662c6075d9edf8bc((null)=0x00005555556a0990, (null)=<unavailable>) at function.rs:248:5
    frame #36: 0x0000555555623c63 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h49f797984e2121bf at boxed.rs:1940:9
    frame #37: 0x0000555555623c5d acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::hfa4f3d0ee6440e0b at boxed.rs:1940
    frame #38: 0x0000555555623c56 acme`std::sys::unix::thread::Thread::new::thread_start::h62ca48b42d48a8fc at thread.rs:108
    frame #39: 0x00007ffff7f1c609 libpthread.so.0`start_thread(arg=<unavailable>) at pthread_create.c:477:8
    frame #40: 0x00007ffff7cec133 libc.so.6`__clone at clone.S:95
eloff commented 1 year ago

In that issue a dev states:

All threads using openssl must be done before main thread exits. I'm not so sure that's unreasonable, or out of step with other things. But it's not going to be addressed in this release. Maybe not ever, just to set expectations accurately.

Is there a way I can join() all the worker threads for the pool r2d2 is using at the end of main() to meet that requirement?

sfackler commented 1 year ago

I'd recommend disabling the atexit callback instead. One way to do this is to call openssl-sys's init function at the very beginning of your program.

eloff commented 1 year ago

I'd recommend disabling the atexit callback instead. One way to do this is to call openssl-sys's init function at the very beginning of your program.

I'll give that a try and report back.

eloff commented 1 year ago

You called it right. Bringing in the latest openssl crate (0.10.43) for rust as an explicit dependency, and calling openssl::init() at the top of main() fixes the crash.

This recent commit to openssl rust here is the one that fixes it: https://github.com/sfackler/rust-openssl/pull/1649/files#r1040260747, I guess I have openssl 111b on my machine. I checked under the debugger that it is that line being executed that disables the atexit handler.

liskin commented 4 days ago

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

I don't suppose there's any chance we could ask the worker threads to exit cleanly and join them? (related: https://github.com/sfackler/r2d2/issues/99)

sfackler commented 4 days ago

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

That's not correct. OPENSSL_INIT_NO_ATEXIT exists in OpenSSL 3.x.

liskin commented 4 days ago

Note that this bug reappeared with the update to OpenSSL 3.x, as there's no flag like OPENSSL_INIT_NO_ATEXIT and thus the original fix for this issue in openssl-sys no longer works.

That's not correct. OPENSSL_INIT_NO_ATEXIT exists in OpenSSL 3.x.

Oh, indeed, apologies. I should get more sleep before commenting next time. Nevermind then. Sorry again!