Under load 10 % of pgbench runs fail with "ERROR: [NEON_SMGR] [shard 0] could not establish connection to pageserver"

Bodobolero commented 8 months ago

I run a production like load in staging with about 10000 projects created. All projects are deployed on the same page server. On average 130 pgbench compute clients are running and 100 compute clients are idling waiting for connection.

In this scenario about 10 % of pgbench compute clients are aborted.

Error message: ERROR: [NEON_SMGR] [shard 0] could not establish connection to pageserver

Steps to reproduce

I am running some pgbench trxns in a compute:

2024-01-26T15:43:56.662+0100    [35mDEBUG[0m    cloudbench  Will execute PGPASSWORD=<secret> 
pgbench -f pgbench_custom_rw_txn.sql@5 -f pgbench_custom_readonly_txn.sql@1 --random-seed=42 -s 1 -c 1 -D neon_sleep_time=2 -D neon_num_aid_keys=500 --protocol=prepared -n 
--time 176 postgresql://peterbendel@ep-still-darkness-98742498.eu-west-1.aws.neon.build:5432/neondb?sslmode=require

Compute log

PG:2024-01-26 14:45:50.506 GMT ttid=07ff82a15f76b2a572d4d5ac55712cd0/fa11255a0f4aa832e5b3798970e919cf [261] LOG:  [NEON_SMGR] [shard 0] dropping connection to page server due to error

Then after a while pgbench stops with Run was aborted

2024-01-26T15:50:48.366+0100    [31mERROR[0m    cloudbench  error suspending tenant dry-mode-18447344: output is "pgbench: setting random seed to 42
pgbench (16.1, server 15.5)
pgbench: error: client 0 script 0 aborted in command 6 query 0: ERROR:  [NEON_SMGR] [shard 0] could not establish connection to pageserver
DETAIL:  connection to server at \"pageserver-2.eu-west-1.aws.neon.build\" (10.10.77.2), port 6400 failed: Connection refused
\tIs the server running on that host and accepting TCP/IP connections?
transaction type: multiple scripts
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 176 s
number of transactions actually processed: 355
number of failed transactions: 0 (0.000%)
latency average = 311.281 ms
initial connection time = 474.570 ms
tps = 3.212527 (without initial connection time)
SQL script 1: pgbench_custom_rw_txn.sql
 - weight: 5 (targets 83.3% of total)
 - 298 transactions (83.9% of total, tps = 2.696713)
 - number of failed transactions: 0 (0.000%)
 - latency average = 332.931 ms
 - latency stddev = 50.779 ms
SQL script 2: pgbench_custom_readonly_txn.sql
 - weight: 1 (targets 16.7% of total)
 - 57 transactions (16.1% of total, tps = 0.515814)
 - number of failed transactions: 0 (0.000%)
 - latency average = 193.938 ms
 - latency stddev = 33.889 ms
pgbench: error: Run was aborted; the above results are incomplete.
", err is exit status 2 {"unit": 436}

Expected result

All pgbench runs finish successfully. If a compute is started because enough resources are available it should also succeed. I only would expect the stand to become slower (lower txn rates/sec) but not to fail.

Actual result

ca 10% of pgbench transactions fail

Environment

Staging

Logs, links

Compute log

https://neonprod.grafana.net/d/000000039/neon-compute-metrics-by-endpoint-id?var-endpoint_id=ep-still-darkness-98742498&var-env=dev&orgId=1&from=1706280120000&to=1706281140000

Bodobolero commented 8 months ago

It seems that this problem was caused by live migration kind of enabled on staging without proper support for it

Bodobolero commented 8 months ago

Will close the issue for now until (if at all) the problem re-appears

neondatabase / neon