I run a production like load in staging with about 10000 projects created. All projects are deployed on the same page server.
On average 130 pgbench compute clients are running and 100 compute clients are idling waiting for connection.
In this scenario about 10 % of pgbench compute clients are aborted.
Error message: ERROR: [NEON_SMGR] [shard 0] could not establish connection to pageserver
PG:2024-01-26 14:45:50.506 GMT ttid=07ff82a15f76b2a572d4d5ac55712cd0/fa11255a0f4aa832e5b3798970e919cf [261] LOG: [NEON_SMGR] [shard 0] dropping connection to page server due to error
Then after a while pgbench stops with Run was aborted
2024-01-26T15:50:48.366+0100 [31mERROR[0m cloudbench error suspending tenant dry-mode-18447344: output is "pgbench: setting random seed to 42
pgbench (16.1, server 15.5)
pgbench: error: client 0 script 0 aborted in command 6 query 0: ERROR: [NEON_SMGR] [shard 0] could not establish connection to pageserver
DETAIL: connection to server at \"pageserver-2.eu-west-1.aws.neon.build\" (10.10.77.2), port 6400 failed: Connection refused
\tIs the server running on that host and accepting TCP/IP connections?
transaction type: multiple scripts
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 176 s
number of transactions actually processed: 355
number of failed transactions: 0 (0.000%)
latency average = 311.281 ms
initial connection time = 474.570 ms
tps = 3.212527 (without initial connection time)
SQL script 1: pgbench_custom_rw_txn.sql
- weight: 5 (targets 83.3% of total)
- 298 transactions (83.9% of total, tps = 2.696713)
- number of failed transactions: 0 (0.000%)
- latency average = 332.931 ms
- latency stddev = 50.779 ms
SQL script 2: pgbench_custom_readonly_txn.sql
- weight: 1 (targets 16.7% of total)
- 57 transactions (16.1% of total, tps = 0.515814)
- number of failed transactions: 0 (0.000%)
- latency average = 193.938 ms
- latency stddev = 33.889 ms
pgbench: error: Run was aborted; the above results are incomplete.
", err is exit status 2 {"unit": 436}
Expected result
All pgbench runs finish successfully. If a compute is started because enough resources are available it should also succeed.
I only would expect the stand to become slower (lower txn rates/sec) but not to fail.
I run a production like load in staging with about 10000 projects created. All projects are deployed on the same page server. On average 130 pgbench compute clients are running and 100 compute clients are idling waiting for connection.
In this scenario about 10 % of pgbench compute clients are aborted.
Error message:
ERROR: [NEON_SMGR] [shard 0] could not establish connection to pageserver
Steps to reproduce
I am running some pgbench trxns in a compute:
Compute log
Then after a while pgbench stops with Run was aborted
Expected result
All pgbench runs finish successfully. If a compute is started because enough resources are available it should also succeed. I only would expect the stand to become slower (lower txn rates/sec) but not to fail.
Actual result
ca 10% of pgbench transactions fail
Environment
Staging
Logs, links
Compute log
https://neonprod.grafana.net/d/000000039/neon-compute-metrics-by-endpoint-id?var-endpoint_id=ep-still-darkness-98742498&var-env=dev&orgId=1&from=1706280120000&to=1706281140000