neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.29k stars 446 forks source link

Statically compiled pgbench crashes with SSL error: internal error / SIGSEGV / SIGABRT #8275

Open bayandin opened 4 months ago

bayandin commented 4 months ago

Steps to reproduce

Expected result

it doesn't fail

Actual result

It fails with SSL error: internal error / SIGSEGV / SIGABRT

Note: Unfortunately I could not reproduce the error with debug version of postgres

transaction type: test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql
scaling factor: 1
query mode: simple
number of clients: 100
number of threads: 20
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 141659
number of failed transactions: 0 (0.000%)
latency average = 628.999 ms
latency stddev = 1347.496 ms
initial connection time = 9329.005 ms
tps = 158.932540 (without initial connection time)
[Inferior 1 (process 27239) exited normally]
(gdb) 

with release build we get segmentation fault or internal errors:

nonroot@03722beba63d:~/neon$ ./pg_install/v16/bin/pgbench -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors "postgresql://neondb_owner:secret@ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build/neondb?sslmode=require"
pgbench (16.3 (b810fdfcbb59afea7ea7bbe0cf94eaccb55a2ea2))
starting vacuum...pgbench: error: ERROR:  relation "pgbench_branches" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_tellers" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_history" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
end.
Segmentation fault

with release build and gdb we see free(): invalid pointer

nonroot@e36fb5340ef9:~/neon$ gdb ./pg_install/v16/bin/pgbench
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./pg_install/v16/bin/pgbench...
(gdb) run -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors "postgresql://neondb_owner:secret@ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build/neondb?sslmode=require"
Starting program: /home/nonroot/neon/pg_install/v16/bin/pgbench -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors "postgresql://neondb_owner:secret@ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build/neondb?sslmode=require"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
pgbench (16.3 (b810fdfcbb59afea7ea7bbe0cf94eaccb55a2ea2))
starting vacuum...pgbench: error: ERROR:  relation "pgbench_branches" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_tellers" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_history" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
end.
[New Thread 0xfffff79d01e0 (LWP 530)]
[New Thread 0xffffef1cf1e0 (LWP 531)]
[New Thread 0xfffff71cf1e0 (LWP 532)]
[New Thread 0xfffff69ce1e0 (LWP 533)]
[New Thread 0xfffff61cd1e0 (LWP 534)]
[New Thread 0xfffff59cc1e0 (LWP 535)]
[New Thread 0xfffff51cb1e0 (LWP 536)]
[New Thread 0xfffff49ca1e0 (LWP 537)]
[New Thread 0xffffeffff1e0 (LWP 538)]
[New Thread 0xffffee9ce1e0 (LWP 539)]
[New Thread 0xffffee1cd1e0 (LWP 540)]
[New Thread 0xffffed9cc1e0 (LWP 541)]
[New Thread 0xffffed1cb1e0 (LWP 542)]
[New Thread 0xffffec9ca1e0 (LWP 543)]
[New Thread 0xffffb7fff1e0 (LWP 544)]
[New Thread 0xffffb77fe1e0 (LWP 545)]
[New Thread 0xffffb6ffd1e0 (LWP 546)]
[New Thread 0xffffb67fc1e0 (LWP 547)]
[New Thread 0xffffb5ffb1e0 (LWP 548)]
pgbench: error: connection to server at "ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build" (3.132.205.114), port 5432 failed: could not create SSL context: no cipher set
pgbench: error: could not create connection for client 25
free(): invalid pointer
free(): invalid pointer
free(): invalid pointer
pgbench: error: connection to server at "ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build" (3.132.205.114), port 5432 failed: could not create SSL context: no SSL error reported
pgbench: error: connection to server at "ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build" (3.132.205.114), port 5432 failed: could not create SSL context: no SSL error reported
pgbench: error: could not create connection for client 90
pgbench: error: connection to server at "ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build" (3.132.205.114), port 5432 failed: could not create SSL context: no SSL error reported
pgbench: error: could not create connection for client 15
pgbench: error: could not create connection for client 80
Cannot find user-level thread for LWP 527: generic error
(gdb) Cannot find user-level thread for LWP 534: generic error
(gdb) Cannot find user-level thread for LWP 543: generic error
(gdb) [Thread 0xffffb5ffb1e0 (LWP 548) exited]
[Thread 0xffffb67fc1e0 (LWP 547) exited]
[Thread 0xffffb6ffd1e0 (LWP 546) exited]
[Thread 0xffffb77fe1e0 (LWP 545) exited]
[Thread 0xffffb7fff1e0 (LWP 544) exited]
[Thread 0xffffec9ca1e0 (LWP 543) exited]
[Thread 0xffffed1cb1e0 (LWP 542) exited]
[Thread 0xffffed9cc1e0 (LWP 541) exited]
[Thread 0xffffee1cd1e0 (LWP 540) exited]
[Thread 0xffffee9ce1e0 (LWP 539) exited]
[Thread 0xffffeffff1e0 (LWP 538) exited]
[Thread 0xfffff49ca1e0 (LWP 537) exited]
[Thread 0xfffff51cb1e0 (LWP 536) exited]
[Thread 0xfffff59cc1e0 (LWP 535) exited]
[Thread 0xfffff61cd1e0 (LWP 534) exited]
[Thread 0xfffff69ce1e0 (LWP 533) exited]
[Thread 0xfffff71cf1e0 (LWP 532) exited]
[Thread 0xffffef1cf1e0 (LWP 531) exited]
[Thread 0xfffff79d01e0 (LWP 530) exited]
bt
Selected thread is running.
(gdb) quit

Using valgrind the error also doesn't reproduce

nonroot@e36fb5340ef9:~/neon$ valgrind --leak-check=full ./pg_install/v16/bin/pgbench -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors "postgresql://neondb_owner:secret@ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build/neondb?sslmode=require"
==33432== Memcheck, a memory error detector
==33432== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==33432== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==33432== Command: ./pg_install/v16/bin/pgbench -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors postgresql://neondb_owner:secret@ep-late-resonance-w2i0q5qu.us-east-2.aws.neon.build/neondb?sslmode=require
==33432== 
pgbench (16.3 (b810fdfcbb59afea7ea7bbe0cf94eaccb55a2ea2))
starting vacuum...pgbench: error: ERROR:  relation "pgbench_branches" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_tellers" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
pgbench: error: ERROR:  relation "pgbench_history" does not exist
pgbench: detail: (ignoring this error and continuing anyway)
end.
progress: 6.3 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 8.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 10.0 s, 0.5 tps, lat 3430.123 ms stddev 0.000, 0 failed
progress: 12.0 s, 4.0 tps, lat 5338.579 ms stddev 602.942, 0 failed
progress: 14.0 s, 6.5 tps, lat 7092.010 ms stddev 302.845, 0 failed
...

Due to those observations I guess it is a race condition, maybe the connection threads 2-100 are started before the openssl library is fully initialized by the libpq library.

Got the following interesting output running the pgbench with valgrind --tool=helgrind pgbench...

==33550== ---Thread-Announcement------------------------------------------
==33550== 
==33550== Thread #7 was created
==33550==    at 0x4D65FEF: clone (clone.S:61)
==33550==    by 0x4C691B7: create_thread (createthread.c:101)
==33550==    by 0x4C6AC9B: pthread_create@@GLIBC_2.17 (pthread_create.c:817)
==33550==    by 0x48509A3: pthread_create_WRK (hg_intercepts.c:425)
==33550==    by 0x203C8F: main (pgbench.c:7257)
==33550== 
==33550== ----------------------------------------------------------------
==33550== 
==33550== Possible data race during write of size 8 at 0x4BA17B8 by thread #10
==33550== Locks held: none
==33550==    at 0x4AD39EC: OPENSSL_init_crypto (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49D926B: OPENSSL_init_ssl (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49DCB57: SSL_CTX_new (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49B4E3B: initialize_SSL (fe-secure-openssl.c:940)
==33550==    by 0x49B413F: pgtls_open_client (fe-secure-openssl.c:134)
==33550==    by 0x49AF8BB: pqsecure_open_client (fe-secure.c:178)
==33550==    by 0x499A8AB: PQconnectPoll (fe-connect.c:3413)
==33550==    by 0x499973F: connectDBComplete (fe-connect.c:2511)
==33550==    by 0x4996ADB: PQconnectdbParams (fe-connect.c:685)
==33550==    by 0x1F6A03: doConnect (pgbench.c:1560)
==33550==    by 0x2041FB: threadRun (pgbench.c:7384)
==33550==    by 0x4850B77: mythread_wrapper (hg_intercepts.c:387)
==33550== 
==33550== This conflicts with a previous write of size 8 by thread #7
==33550== Locks held: none
==33550==    at 0x4AD39EC: OPENSSL_init_crypto (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49D926B: OPENSSL_init_ssl (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49E36AB: SSL_SESSION_new (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49E3D1B: ssl_get_new_session (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49EFDD3: tls_construct_client_hello (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49EF6AF: state_machine.part.0 (in /home/nonroot/neon/pg_install/v16/lib/libpq.so.5.16)
==33550==    by 0x49B5BE3: open_client_SSL (fe-secure-openssl.c:1480)
==33550==    by 0x49B415F: pgtls_open_client (fe-secure-openssl.c:143)
==33550==  Address 0x4ba17b8 is 0 bytes inside data symbol "conf_settings"

Environment

Staging

Logs, links

Additional considerations

I am not happy that we are using openssl 1.1.1w because

https://www.openssl.org/source/

states:

Note: The latest stable version is the 3.3 series supported until 9th April 2026. Also available is the 3.2 series supported until 23rd November 2025, the 3.1 series supported until 14th March 2025, and the 3.0 series which is a Long Term Support (LTS) version and is supported until 7th September 2026. All older versions (including 1.1.1, 1.1.0, 1.0.2, 1.0.0 and 0.9.8) are now out of support and should not be used. Users of these older versions are encouraged to upgrade to 3.2 or 3.0 as soon as possible. Extended support for 1.1.1 and 1.0.2 to gain access to security fixes for those versions is available.

I think we should use the 3.0 series, specifically version 3.0.14, which is a Long Term Support (LTS) version and is supported until 7th September 2026. @hlinnaka you may have guidance on which version to use?

Bodobolero commented 4 months ago

Project to use in connection string steep-flower-78097288

Provided connection string in slack https://neondb.slack.com/archives/C04DGM6SMTM/p1720697873106209?thread_ts=1719586520.958489&cid=C04DGM6SMTM

ololobus commented 4 months ago

Tristan will look into this later this week

Bodobolero commented 4 months ago

using openssl 3.0.14 I could get a sigabbt with backtrace using gdb:

Thread 12 "pgbench" received signal SIGABRT, Aborted.
[Switching to Thread 0xffffd67fc1e0 (LWP 40231)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000fffff7900aa0 in __GI_abort () at abort.c:79
#2  0x0000fffff794d280 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xfffff7a109d8 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x0000fffff79547dc in malloc_printerr (str=str@entry=0xfffff7a0c570 "double free or corruption (out)") at malloc.c:5347
#4  0x0000fffff7955c20 in _int_free (av=0xfffff7a4fa98 <main_arena>, p=0xffffc000a670, have_lock=<optimized out>) at malloc.c:4314
#5  0x0000fffff7e119b8 in ERR_pop_to_mark () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#6  0x0000fffff7d1d754 in ssl_evp_cipher_fetch () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#7  0x0000fffff7d11bac in ssl_load_ciphers () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#8  0x0000fffff7d1e75c in SSL_CTX_new_ex () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#9  0x0000fffff7cf9e68 in initialize_SSL (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
#10 0x0000fffff7cf9148 in pgtls_open_client (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
#11 0x0000fffff7cf48bc in pqsecure_open_client (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
#12 0x0000fffff7cdf8ac in PQconnectPoll (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
#13 0x0000fffff7cde740 in connectDBComplete (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#14 0x0000fffff7cdbadc in PQconnectdbParams (keywords=0xffffd67fb350, values=0xffffd67fb388, expand_dbname=1) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#15 0x0000aaaaaabfea04 in doConnect () at /home/nonroot/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#16 0x0000aaaaaac0c1fc in threadRun (arg=0xaaaaaaee6540) at /home/nonroot/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#17 0x0000fffff7a5b648 in start_thread (arg=0xffffd67fbae0) at pthread_create.c:477
#18 0x0000fffff79b201c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
tristan957 commented 4 months ago

All I've figured out so far is that problem is much more likely to occur with higher number of client and jobs.

tristan957 commented 4 months ago

Why do we use openssl 1.1.1w? Why not the 3.2 series?

Bodobolero commented 4 months ago

Why do we use openssl 1.1.1w? Why not the 3.2 series?

I think debian bullseye and bookworm (our deployment linux in prod for pageserver and safe keeper) still use openssl 1.1.1 and back port security patches in their distribution. However since we are now statically linking I agree that we should use a newer version of openssl. The current LTS train of openssl is the 3.0.x train which has the longest support cycle, while 3.2 and 3.3 have shorter release cycles. That is why I suggested to use 3.0.14 Did you have successful runs with 3.2 or 3.3? @tristan957 Asking because 3.0.14 didn't resolve the issue for me.

tristan957 commented 4 months ago

I'm gonna create a better dev env for this today. Going to install 3.2 to test out and see what happens. Also going to play around with vanilla today too.

Overall I find this very strange. Given we don't patch libpq or pgbench as far as I'm aware, I'm extremely confused.

Bodobolero commented 4 months ago

in the mean time we have deployed temporary workaround PRs https://github.com/neondatabase/neon/pull/8422 https://github.com/neondatabase/neon/pull/8429

Bodobolero commented 4 months ago

Given we don't patch libpq or pgbench as far as I'm aware, I'm extremely confused.

@tristan957 I think what is different though than in vanilla is that since a few weeks we try to build all binaries statically linked with openssl. I don't know if anyone else is doing that for pgbench AND running it with -c 100 and -j 20 (high probability of races). At least the "official" deb and ubuntu images use openssl shared load libraries.

tristan957 commented 4 months ago

Right. I want to try statically compiling vanilla with static openssl and icu. Because this could easily be an upstream OpenSSL bug.

tristan957 commented 4 months ago

I got this from openssl 3.2.2, which is what Peter was getting in 3.0.14. Going to spend some time in a debugger trying to figure out how this error manifests.

LD_PRELOAD=/usr/lib64/libasan.so.8.0.0 ./pg_install/v16/bin/pgbench -c100 -j20 -T900 -P2 --verbose-errors '$CONNSTR'
pgbench (16.3 (b39f316137fdd29e2da15d2af2fdd1cfd18163be))
starting vacuum...end.
=================================================================
==100871==ERROR: AddressSanitizer: attempting double-free on 0x5030000561a0 in thread T8:
    #0 0x7f2fcd9e9638 in free.part.0 (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x7f2fcd2d7dfa in CRYPTO_free crypto/mem.c:282
    #2 0x7f2fcd2b917a in err_clear crypto/err/err_local.h:91
    #3 0x7f2fcd2b927c in ERR_pop_to_mark crypto/err/err_mark.c:39
    #4 0x7f2fcd64bfbe in ssl_evp_cipher_fetch ssl/ssl_lib.c:7176
    #5 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
    #6 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
    #7 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
    #8 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
    #9 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
    #10 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
    #11 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
    #12 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
    #13 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
    #14 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
    #15 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
    #16 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #17 0x7f2fccfc2506 in start_thread (/lib64/libc.so.6+0x97506) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #18 0x7f2fcd04640b in clone3 (/lib64/libc.so.6+0x11b40b) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)

0x5030000561a0 is located 0 bytes inside of 23-byte region [0x5030000561a0,0x5030000561b7)
freed by thread T12 here:
    #0 0x7f2fcd9e9638 in free.part.0 (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x7f2fcd2d7dfa in CRYPTO_free crypto/mem.c:282
    #2 0x7f2fcd2b8c21 in err_clear crypto/err/err_local.h:91
    #3 0x7f2fcd2b8cb8 in ERR_new crypto/err/err_blocks.c:26
    #4 0x7f2fcd2bdb0a in inner_evp_generic_fetch crypto/evp/evp_fetch.c:355
    #5 0x7f2fcd2bdc09 in evp_generic_fetch crypto/evp/evp_fetch.c:378
    #6 0x7f2fcd479d93 in EVP_CIPHER_fetch crypto/evp/evp_enc.c:1717
    #7 0x7f2fcd64bfb5 in ssl_evp_cipher_fetch ssl/ssl_lib.c:7175
    #8 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
    #9 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
    #10 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
    #11 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
    #12 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
    #13 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
    #14 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
    #15 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
    #16 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
    #17 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
    #18 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
    #19 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)

previously allocated by thread T13 here:
    #0 0x7f2fcd9ea997 in malloc (/usr/lib64/libasan.so.8.0.0+0xf7997) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x7f2fcd2d7b6e in CRYPTO_malloc crypto/mem.c:202
    #2 0x7f2fcd2b89d5 in err_set_debug crypto/err/err_local.h:60
    #3 0x7f2fcd2b8d07 in ERR_set_debug crypto/err/err_blocks.c:37
    #4 0x7f2fcd2bdb28 in inner_evp_generic_fetch crypto/evp/evp_fetch.c:355
    #5 0x7f2fcd2bdc09 in evp_generic_fetch crypto/evp/evp_fetch.c:378
    #6 0x7f2fcd479d93 in EVP_CIPHER_fetch crypto/evp/evp_enc.c:1717
    #7 0x7f2fcd64bfb5 in ssl_evp_cipher_fetch ssl/ssl_lib.c:7175
    #8 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
    #9 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
    #10 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
    #11 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
    #12 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
    #13 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
    #14 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
    #15 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
    #16 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
    #17 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
    #18 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
    #19 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)

Thread T8 created by T0 here:
    #0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
    #2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #4 0x405024 in _start (/home/tristan957/Projects/work/neon/pg_install/v16/bin/pgbench+0x405024) (BuildId: 367fdc1c3d7ec9279f4ddf0e20a659b17dca462e)

Thread T12 created by T0 here:
    #0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
    #2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #4 0x405024 in _start (/home/tristan957/Projects/work/neon/pg_install/v16/bin/pgbench+0x405024) (BuildId: 367fdc1c3d7ec9279f4ddf0e20a659b17dca462e)

Thread T13 created by T0 here:
    #0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
    #1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
    #2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #4 0x405024 in _

SUMMARY: AddressSanitizer: double-free (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d) in free.part.0
==100871==ABORTING
tristan957 commented 4 months ago

The data that openssl is freeing is in manually implemented thread local storage.

static CRYPTO_ONCE err_init = CRYPTO_ONCE_STATIC_INIT;
static int set_err_thread_local;
static CRYPTO_THREAD_LOCAL err_thread_local;

DEFINE_RUN_ONCE_STATIC(err_do_init)
{
    set_err_thread_local = 1;
    return CRYPTO_THREAD_init_local(&err_thread_local, NULL);
}

static void *thread_local_storage[OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX];

int CRYPTO_THREAD_init_local(CRYPTO_THREAD_LOCAL *key, void (*cleanup)(void *))
{
    static unsigned int thread_local_key = 0;

    if (thread_local_key >= OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX)
        return 0;

    *key = thread_local_key++;

    thread_local_storage[*key] = NULL;

    return 1;
}

void *CRYPTO_THREAD_get_local(CRYPTO_THREAD_LOCAL *key)
{
    if (*key >= OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX)
        return NULL;

    return thread_local_storage[*key];
}

So looking at the stacktraces I posted, there are 3 different threads at play (wtf). Allocated in one, and freed two different times in different threads. I don't understand how openssl is guaranteeing that the thread_local_storage array is actually thread_local.

hlinnaka commented 4 months ago

I wonder if this upstream buildfarm failure is related: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cisticola&dt=2024-07-29%2016%3A20%3A36. It looks like it started failing 7 days ago, with no apparent code changes. An OS update happened perhaps? It's interesting that it's the same function ERR_set_debug function.

tristan957 commented 4 months ago

That error occurs if you want to compile against openssl 3.X, but your toolchain picks up openssl 1.X. Those functions were added in the 3.X series. Don't ask me how I know :smiling_face_with_tear:

tristan957 commented 4 months ago

From the configure log

checking for openssl... /usr/bin/openssl
configure: using openssl: OpenSSL 1.1.1g FIPS  21 Apr 2020
tristan957 commented 4 months ago

I can recreate the segfault in vanilla PG, which is not what Peter said. I'm thinking there was just an issue in his environment.

This makes sense since I don't think we patch libpq or pgbench at all.

tristan957 commented 4 months ago

It's always the most obvious issue. OpenSSL just doesn't support multithreading when being statically compiled.

Relevant code: https://github.com/openssl/openssl/blob/07e4d7f4747005e3ce56423182ad047eb05d8e16/Configure#L1469-L1471 Related issue: https://github.com/openssl/openssl/issues/14574

This is an issue that upstream is willing to accept a contribution for: https://github.com/openssl/openssl/issues/14574#issuecomment-2257083626. @ololobus is this something I can spend some time on?

Bodobolero commented 4 months ago

I can recreate the segfault in vanilla PG, which is not what Peter said.

I think I didn't try vanilla with static. What I intended to say is that the error doesn't show if you use vanilla as it is built in the distributed binaries(with shared load libraries for openssl). See my comment above on this

ololobus commented 4 months ago

This is an issue that upstream is willing to accept a contribution for: https://github.com/openssl/openssl/issues/14574#issuecomment-2257083626. @ololobus is this something I can spend some time on?

To answer this question, let me clarify how we ended up debugging this issue and the context.

So the root cause is that statically compiled OpenSSL doesn't support multi-threading, right?

If yes, then the next question is, why is this important for us? My understanding is that Postgres anyway doesn't use multi-threading, so it's not a problem. For client libraries like pgbench, why do we want to compile them statically at all? We do not use them in prod. If for Postgres itself it does make sense -- we want to package redo Postgres, so it was independent of the host system -- for client libraries it's not critical, we can even install them with some standard system packages

So my suggestion is to just stop doing this for client binaries

what is different though than in vanilla is that since a few weeks we try to build all binaries statically linked with openssl

and that should solve the problem? If yes, I'd consider the other work like replication observability and tests a higher priority

ololobus commented 4 months ago

@Bodobolero @bayandin based on the investigation and above comment, and that fixing this requires upstream work, I'm putting this on pause. I think we have a workaround -- just do not build pgbench / client libs statically

Bodobolero commented 4 months ago

@ololobus @bayandin Who should be the DRI (in the compute team?) to build the postgres binaries in neon artifacts with dynamic load lib. I think this is the logical follow-up. Currently compute image and neon artifacts contains statically linked binaries

ololobus commented 4 months ago

Who should be the DRI (in the compute team?) to build the postgres binaries in neon artifacts with dynamic load lib

I think I still don't get the objective to answer who is the DRI for what :)

In this task I see that it's some dev container under discussion, why cannot we just install these packages:

That should give us psql, pgbench and other client libs

Bodobolero commented 3 months ago

The problem is we build all postgres binaries with static openssl and upload them as neon artifacts (including psql and pgbench) to S3 bucket

https://github.com/neondatabase/neon/blob/859f01918529d5e6547ac4ff8e05a4e5775520a2/.github/workflows/_build-and-test-locally.yml#L139-L140

So far other workflows use these neon artifacts (including pgbench and psql) to run their jobs. If these binaries are broken now because we changed from shared load libraries to static openssl library we can no longer use them. This means whoever initiated the change to use static library should fix the broken workflows or talk to the owners of these.

@bayandin @ololobus

ololobus commented 3 months ago

This means whoever initiated the change to use static library should fix the broken workflows or talk to the owners of these.

Yeah, do you know where this dynamic to static build transition project is tracked? I think I have nearly zero context on it. I actually thought that it was more of a long-term plan, not something that we started doing right away

Bodobolero commented 3 months ago

I think the changes were introduced in https://github.com/neondatabase/neon/pull/8074

bayandin commented 3 months ago

So far other workflows use these neon artifacts (including pgbench and psql) to run their jobs.

I guess we can swap it with the system postgresql-client package. It'll require some changes in tests. Currently, tests rely on binaries that are in pg_install/${PG_VERSION}/bin, but afair we don't have any patches for them (despite the thing it'll use one version of psql/pgbench for different versions of Postgres).

Bodobolero commented 3 months ago

I guess we can swap it with the system postgresql-client package.

For some of the tests these changes will be quite expensive, some tests e.g. run on debian bullseye which only supports VERY outdated system packages (if you don't build from source), I guess some even don't support the sslmode=require connection attribute, yet.

bayandin commented 3 months ago

run on debian bullseye which only supports VERY outdated system packages

We can install the latest version from Postgres' apt repo: https://www.postgresql.org/download/linux/debian/

Bodobolero commented 3 months ago

We can install the latest version from Postgres' apt repo

Yes this is our current work-around, see https://github.com/neondatabase/neon/blob/859f01918529d5e6547ac4ff8e05a4e5775520a2/.github/workflows/benchmarking.yml#L469

It is a bit complicated because we run in container without sudo privileges so we can not "install" the postgres packages from the apt repo

bayandin commented 3 months ago

It is a bit complicated because we run in container without sudo privileges so we can not "install" the postgres packages from the apt repo

We can add it to build-tools image

ololobus commented 1 week ago

@Bodobolero @bayandin is it still a problem? Or we use some workaround?

Bodobolero commented 1 week ago

It is still a problem that

We have the following workaround for our pgvector benchmark: https://github.com/neondatabase/neon/blob/e51cf6157b2a25907dd5b7c442f838af5cdbf54a/.github/workflows/benchmarking.yml#L561 TLDR: we build pgbench ourselves from postgres sources in the benchmarking workflow instead of using pgbench from Neon artifacts

bayandin commented 1 week ago

We found a workaround (install pgbench from deb packages), but it complicated the workflow. Also, we can't stay on openssl1.1 forever, so I think we need to find a proper solution for this.

ololobus commented 1 week ago

OK, thanks for the replies. Since the issue is with the upstream library -- openssl, we currently do not have plans or the capacity to work on it. I'm still moving it to Selected instead of Backlog because it seems to be a good item to contribute, but just want to make it explicit that it's not a team priority at this moment