Open bayandin opened 4 months ago
Project to use in connection string steep-flower-78097288
Provided connection string in slack https://neondb.slack.com/archives/C04DGM6SMTM/p1720697873106209?thread_ts=1719586520.958489&cid=C04DGM6SMTM
Tristan will look into this later this week
using openssl 3.0.14 I could get a sigabbt with backtrace using gdb:
Thread 12 "pgbench" received signal SIGABRT, Aborted.
[Switching to Thread 0xffffd67fc1e0 (LWP 40231)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000fffff7900aa0 in __GI_abort () at abort.c:79
#2 0x0000fffff794d280 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0xfffff7a109d8 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x0000fffff79547dc in malloc_printerr (str=str@entry=0xfffff7a0c570 "double free or corruption (out)") at malloc.c:5347
#4 0x0000fffff7955c20 in _int_free (av=0xfffff7a4fa98 <main_arena>, p=0xffffc000a670, have_lock=<optimized out>) at malloc.c:4314
#5 0x0000fffff7e119b8 in ERR_pop_to_mark () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#6 0x0000fffff7d1d754 in ssl_evp_cipher_fetch () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#7 0x0000fffff7d11bac in ssl_load_ciphers () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#8 0x0000fffff7d1e75c in SSL_CTX_new_ex () from /home/nonroot/neon/pg_install/v16/lib/libpq.so.5
#9 0x0000fffff7cf9e68 in initialize_SSL (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
#10 0x0000fffff7cf9148 in pgtls_open_client (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
#11 0x0000fffff7cf48bc in pqsecure_open_client (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
#12 0x0000fffff7cdf8ac in PQconnectPoll (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
#13 0x0000fffff7cde740 in connectDBComplete (conn=0xffffc0000ba0) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#14 0x0000fffff7cdbadc in PQconnectdbParams (keywords=0xffffd67fb350, values=0xffffd67fb388, expand_dbname=1) at /home/nonroot/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#15 0x0000aaaaaabfea04 in doConnect () at /home/nonroot/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#16 0x0000aaaaaac0c1fc in threadRun (arg=0xaaaaaaee6540) at /home/nonroot/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#17 0x0000fffff7a5b648 in start_thread (arg=0xffffd67fbae0) at pthread_create.c:477
#18 0x0000fffff79b201c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
All I've figured out so far is that problem is much more likely to occur with higher number of client and jobs.
Why do we use openssl 1.1.1w? Why not the 3.2 series?
Why do we use openssl 1.1.1w? Why not the 3.2 series?
I think debian bullseye and bookworm (our deployment linux in prod for pageserver and safe keeper) still use openssl 1.1.1 and back port security patches in their distribution. However since we are now statically linking I agree that we should use a newer version of openssl. The current LTS train of openssl is the 3.0.x train which has the longest support cycle, while 3.2 and 3.3 have shorter release cycles. That is why I suggested to use 3.0.14 Did you have successful runs with 3.2 or 3.3? @tristan957 Asking because 3.0.14 didn't resolve the issue for me.
I'm gonna create a better dev env for this today. Going to install 3.2 to test out and see what happens. Also going to play around with vanilla today too.
Overall I find this very strange. Given we don't patch libpq or pgbench as far as I'm aware, I'm extremely confused.
in the mean time we have deployed temporary workaround PRs https://github.com/neondatabase/neon/pull/8422 https://github.com/neondatabase/neon/pull/8429
Given we don't patch libpq or pgbench as far as I'm aware, I'm extremely confused.
@tristan957 I think what is different though than in vanilla is that since a few weeks we try to build all binaries statically linked with openssl. I don't know if anyone else is doing that for pgbench AND running it with -c 100 and -j 20 (high probability of races). At least the "official" deb and ubuntu images use openssl shared load libraries.
Right. I want to try statically compiling vanilla with static openssl and icu. Because this could easily be an upstream OpenSSL bug.
I got this from openssl 3.2.2, which is what Peter was getting in 3.0.14. Going to spend some time in a debugger trying to figure out how this error manifests.
LD_PRELOAD=/usr/lib64/libasan.so.8.0.0 ./pg_install/v16/bin/pgbench -c100 -j20 -T900 -P2 --verbose-errors '$CONNSTR'
pgbench (16.3 (b39f316137fdd29e2da15d2af2fdd1cfd18163be))
starting vacuum...end.
=================================================================
==100871==ERROR: AddressSanitizer: attempting double-free on 0x5030000561a0 in thread T8:
#0 0x7f2fcd9e9638 in free.part.0 (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x7f2fcd2d7dfa in CRYPTO_free crypto/mem.c:282
#2 0x7f2fcd2b917a in err_clear crypto/err/err_local.h:91
#3 0x7f2fcd2b927c in ERR_pop_to_mark crypto/err/err_mark.c:39
#4 0x7f2fcd64bfbe in ssl_evp_cipher_fetch ssl/ssl_lib.c:7176
#5 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
#6 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
#7 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
#8 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
#9 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
#10 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
#11 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
#12 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#13 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#14 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#15 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#16 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#17 0x7f2fccfc2506 in start_thread (/lib64/libc.so.6+0x97506) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#18 0x7f2fcd04640b in clone3 (/lib64/libc.so.6+0x11b40b) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
0x5030000561a0 is located 0 bytes inside of 23-byte region [0x5030000561a0,0x5030000561b7)
freed by thread T12 here:
#0 0x7f2fcd9e9638 in free.part.0 (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x7f2fcd2d7dfa in CRYPTO_free crypto/mem.c:282
#2 0x7f2fcd2b8c21 in err_clear crypto/err/err_local.h:91
#3 0x7f2fcd2b8cb8 in ERR_new crypto/err/err_blocks.c:26
#4 0x7f2fcd2bdb0a in inner_evp_generic_fetch crypto/evp/evp_fetch.c:355
#5 0x7f2fcd2bdc09 in evp_generic_fetch crypto/evp/evp_fetch.c:378
#6 0x7f2fcd479d93 in EVP_CIPHER_fetch crypto/evp/evp_enc.c:1717
#7 0x7f2fcd64bfb5 in ssl_evp_cipher_fetch ssl/ssl_lib.c:7175
#8 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
#9 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
#10 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
#11 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
#12 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
#13 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
#14 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
#15 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#16 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#17 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#18 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#19 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
previously allocated by thread T13 here:
#0 0x7f2fcd9ea997 in malloc (/usr/lib64/libasan.so.8.0.0+0xf7997) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x7f2fcd2d7b6e in CRYPTO_malloc crypto/mem.c:202
#2 0x7f2fcd2b89d5 in err_set_debug crypto/err/err_local.h:60
#3 0x7f2fcd2b8d07 in ERR_set_debug crypto/err/err_blocks.c:37
#4 0x7f2fcd2bdb28 in inner_evp_generic_fetch crypto/evp/evp_fetch.c:355
#5 0x7f2fcd2bdc09 in evp_generic_fetch crypto/evp/evp_fetch.c:378
#6 0x7f2fcd479d93 in EVP_CIPHER_fetch crypto/evp/evp_enc.c:1717
#7 0x7f2fcd64bfb5 in ssl_evp_cipher_fetch ssl/ssl_lib.c:7175
#8 0x7f2fcd6361c4 in ssl_load_ciphers ssl/ssl_ciph.c:333
#9 0x7f2fcd644292 in SSL_CTX_new_ex ssl/ssl_lib.c:3906
#10 0x7f2fcd644967 in SSL_CTX_new ssl/ssl_lib.c:4092
#11 0x7f2fcd2a2498 in initialize_SSL /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:940
#12 0x7f2fcd2a177a in pgtls_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure-openssl.c:134
#13 0x7f2fcd29ce2e in pqsecure_open_client /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-secure.c:178
#14 0x7f2fcd2881fa in PQconnectPoll /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3413
#15 0x7f2fcd286ba4 in connectDBComplete /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#16 0x7f2fcd283bf6 in PQconnectdbParams /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#17 0x40a21a in doConnect /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#18 0x416af6 in threadRun /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#19 0x7f2fcd950f95 in asan_thread_start(void*) (/usr/lib64/libasan.so.8.0.0+0x5df95) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
Thread T8 created by T0 here:
#0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
#2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#4 0x405024 in _start (/home/tristan957/Projects/work/neon/pg_install/v16/bin/pgbench+0x405024) (BuildId: 367fdc1c3d7ec9279f4ddf0e20a659b17dca462e)
Thread T12 created by T0 here:
#0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
#2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#4 0x405024 in _start (/home/tristan957/Projects/work/neon/pg_install/v16/bin/pgbench+0x405024) (BuildId: 367fdc1c3d7ec9279f4ddf0e20a659b17dca462e)
Thread T13 created by T0 here:
#0 0x7f2fcd9e2871 in pthread_create (/usr/lib64/libasan.so.8.0.0+0xef871) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d)
#1 0x4165a8 in main /home/tristan957/Projects/work/neon//vendor/postgres-v16/src/bin/pgbench/pgbench.c:7257
#2 0x7f2fccf55087 in __libc_start_call_main (/lib64/libc.so.6+0x2a087) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#3 0x7f2fccf5514a in __libc_start_main_alias_2 (/lib64/libc.so.6+0x2a14a) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
#4 0x405024 in _
SUMMARY: AddressSanitizer: double-free (/usr/lib64/libasan.so.8.0.0+0xf6638) (BuildId: c1431025b5d8af781c22c9ceea71f065c547d32d) in free.part.0
==100871==ABORTING
The data that openssl is freeing is in manually implemented thread local storage.
static CRYPTO_ONCE err_init = CRYPTO_ONCE_STATIC_INIT;
static int set_err_thread_local;
static CRYPTO_THREAD_LOCAL err_thread_local;
DEFINE_RUN_ONCE_STATIC(err_do_init)
{
set_err_thread_local = 1;
return CRYPTO_THREAD_init_local(&err_thread_local, NULL);
}
static void *thread_local_storage[OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX];
int CRYPTO_THREAD_init_local(CRYPTO_THREAD_LOCAL *key, void (*cleanup)(void *))
{
static unsigned int thread_local_key = 0;
if (thread_local_key >= OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX)
return 0;
*key = thread_local_key++;
thread_local_storage[*key] = NULL;
return 1;
}
void *CRYPTO_THREAD_get_local(CRYPTO_THREAD_LOCAL *key)
{
if (*key >= OPENSSL_CRYPTO_THREAD_LOCAL_KEY_MAX)
return NULL;
return thread_local_storage[*key];
}
So looking at the stacktraces I posted, there are 3 different threads at play (wtf). Allocated in one, and freed two different times in different threads. I don't understand how openssl is guaranteeing that the thread_local_storage
array is actually thread_local
.
I wonder if this upstream buildfarm failure is related: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cisticola&dt=2024-07-29%2016%3A20%3A36. It looks like it started failing 7 days ago, with no apparent code changes. An OS update happened perhaps? It's interesting that it's the same function ERR_set_debug function.
That error occurs if you want to compile against openssl 3.X, but your toolchain picks up openssl 1.X. Those functions were added in the 3.X series. Don't ask me how I know :smiling_face_with_tear:
From the configure log
checking for openssl... /usr/bin/openssl
configure: using openssl: OpenSSL 1.1.1g FIPS 21 Apr 2020
I can recreate the segfault in vanilla PG, which is not what Peter said. I'm thinking there was just an issue in his environment.
This makes sense since I don't think we patch libpq or pgbench at all.
It's always the most obvious issue. OpenSSL just doesn't support multithreading when being statically compiled.
Relevant code: https://github.com/openssl/openssl/blob/07e4d7f4747005e3ce56423182ad047eb05d8e16/Configure#L1469-L1471 Related issue: https://github.com/openssl/openssl/issues/14574
This is an issue that upstream is willing to accept a contribution for: https://github.com/openssl/openssl/issues/14574#issuecomment-2257083626. @ololobus is this something I can spend some time on?
I can recreate the segfault in vanilla PG, which is not what Peter said.
I think I didn't try vanilla with static. What I intended to say is that the error doesn't show if you use vanilla as it is built in the distributed binaries(with shared load libraries for openssl). See my comment above on this
This is an issue that upstream is willing to accept a contribution for: https://github.com/openssl/openssl/issues/14574#issuecomment-2257083626. @ololobus is this something I can spend some time on?
To answer this question, let me clarify how we ended up debugging this issue and the context.
So the root cause is that statically compiled OpenSSL doesn't support multi-threading, right?
If yes, then the next question is, why is this important for us? My understanding is that Postgres anyway doesn't use multi-threading, so it's not a problem. For client libraries like pgbench, why do we want to compile them statically at all? We do not use them in prod. If for Postgres itself it does make sense -- we want to package redo Postgres, so it was independent of the host system -- for client libraries it's not critical, we can even install them with some standard system packages
So my suggestion is to just stop doing this for client binaries
what is different though than in vanilla is that since a few weeks we try to build all binaries statically linked with openssl
and that should solve the problem? If yes, I'd consider the other work like replication observability and tests a higher priority
@Bodobolero @bayandin based on the investigation and above comment, and that fixing this requires upstream work, I'm putting this on pause. I think we have a workaround -- just do not build pgbench / client libs statically
@ololobus @bayandin Who should be the DRI (in the compute team?) to build the postgres binaries in neon artifacts with dynamic load lib. I think this is the logical follow-up. Currently compute image and neon artifacts contains statically linked binaries
Who should be the DRI (in the compute team?) to build the postgres binaries in neon artifacts with dynamic load lib
I think I still don't get the objective to answer who is the DRI for what :)
In this task I see that it's some dev container under discussion, why cannot we just install these packages:
That should give us psql, pgbench and other client libs
The problem is we build all postgres binaries with static openssl and upload them as neon artifacts (including psql and pgbench) to S3 bucket
So far other workflows use these neon artifacts (including pgbench and psql) to run their jobs. If these binaries are broken now because we changed from shared load libraries to static openssl library we can no longer use them. This means whoever initiated the change to use static library should fix the broken workflows or talk to the owners of these.
@bayandin @ololobus
This means whoever initiated the change to use static library should fix the broken workflows or talk to the owners of these.
Yeah, do you know where this dynamic to static build transition project is tracked? I think I have nearly zero context on it. I actually thought that it was more of a long-term plan, not something that we started doing right away
I think the changes were introduced in https://github.com/neondatabase/neon/pull/8074
So far other workflows use these neon artifacts (including pgbench and psql) to run their jobs.
I guess we can swap it with the system postgresql-client
package.
It'll require some changes in tests. Currently, tests rely on binaries that are in pg_install/${PG_VERSION}/bin
, but afair we don't have any patches for them (despite the thing it'll use one version of psql
/pgbench
for different versions of Postgres).
I guess we can swap it with the system postgresql-client package.
For some of the tests these changes will be quite expensive, some tests e.g. run on debian bullseye which only supports VERY outdated system packages (if you don't build from source), I guess some even don't support the sslmode=require connection attribute, yet.
run on debian bullseye which only supports VERY outdated system packages
We can install the latest version from Postgres' apt repo: https://www.postgresql.org/download/linux/debian/
We can install the latest version from Postgres' apt repo
Yes this is our current work-around, see https://github.com/neondatabase/neon/blob/859f01918529d5e6547ac4ff8e05a4e5775520a2/.github/workflows/benchmarking.yml#L469
It is a bit complicated because we run in container without sudo privileges so we can not "install" the postgres packages from the apt repo
It is a bit complicated because we run in container without sudo privileges so we can not "install" the postgres packages from the apt repo
We can add it to build-tools
image
@Bodobolero @bayandin is it still a problem? Or we use some workaround?
It is still a problem that
We have the following workaround for our pgvector benchmark: https://github.com/neondatabase/neon/blob/e51cf6157b2a25907dd5b7c442f838af5cdbf54a/.github/workflows/benchmarking.yml#L561 TLDR: we build pgbench ourselves from postgres sources in the benchmarking workflow instead of using pgbench from Neon artifacts
We found a workaround (install pgbench
from deb packages), but it complicated the workflow. Also, we can't stay on openssl1.1 forever, so I think we need to find a proper solution for this.
OK, thanks for the replies. Since the issue is with the upstream library -- openssl, we currently do not have plans or the capacity to work on it. I'm still moving it to Selected instead of Backlog because it seems to be a good item to contribute, but just want to make it explicit that it's not a team priority at this moment
Steps to reproduce
neondatabase/build-tools:pinned
image (Dockerfile)pgbench -f test_runner/performance/pgvector/pgbench_custom_script_pgvector_halfvec_queries.sql -c100 -j20 -T900 -P2 --verbose-errors <staging-connstr>
from the containerExpected result
it doesn't fail
Actual result
It fails with SSL error: internal error / SIGSEGV / SIGABRT
Note: Unfortunately I could not reproduce the error with debug version of postgres
with release build we get segmentation fault or internal errors:
with release build and gdb we see
free(): invalid pointer
Using valgrind the error also doesn't reproduce
Due to those observations I guess it is a race condition, maybe the connection threads 2-100 are started before the openssl library is fully initialized by the libpq library.
Got the following interesting output running the pgbench with
valgrind --tool=helgrind pgbench...
Environment
Staging
Logs, links
Additional considerations
I am not happy that we are using openssl 1.1.1w because
https://www.openssl.org/source/
states:
Note: The latest stable version is the 3.3 series supported until 9th April 2026. Also available is the 3.2 series supported until 23rd November 2025, the 3.1 series supported until 14th March 2025, and the 3.0 series which is a Long Term Support (LTS) version and is supported until 7th September 2026. All older versions (including 1.1.1, 1.1.0, 1.0.2, 1.0.0 and 0.9.8) are now out of support and should not be used. Users of these older versions are encouraged to upgrade to 3.2 or 3.0 as soon as possible. Extended support for 1.1.1 and 1.0.2 to gain access to security fixes for those versions is available.
I think we should use the 3.0 series, specifically version 3.0.14, which is a Long Term Support (LTS) version and is supported until 7th September 2026. @hlinnaka you may have guidance on which version to use?