neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 429 forks source link

Connection reset by peer error #6467

Closed Bodobolero closed 5 months ago

Bodobolero commented 8 months ago

Observation:

Important: we also see 276 cases of could not receive data from client: Connection reset by peer in a few hours in production

Staging/dev: about 5 % of tenants running pgbench fail with "fatal: Run was aborted; the above results are incomplete."

Analysis of proxy and compute logs shows that the root cause looks like issues in proxy<->compute connection, because:

Steps to reproduce

With 500 Postgres compute instances active associated with a single page server (default suspension timeout):

Note that this happens in staging during test of ./cloudbench productionlike_bench init --config productionlike_warmup.yaml --apikey <secret>, see https://github.com/neondatabase/cloud/blob/1da98ec0e262fb49c7b85127b1e447c45bd64499/bench/internal/controllers/productionlikecontroller/bench/productionlikebenchinit.go

Expected result

Each pgbench runs to completion as the other 95 % do.

Actual result

Approximately 5 % of pgbench runs fail with connection reset by peer.

Environment

Staging and Prod

Logs, links

client side (cloud bench):

2024-01-24T13:58:11.184Z        DEBUG   cloudbench      Will execute PGPASSWORD=<secret> pgbench -f pgbench_custom_readonly_txn.sql@1 --random-seed=42 -s 1 -c 1 -D neon_sleep_time=2 -D neon_num_aid_keys=500 --protocol=prepared -n --time 60 postgresql://peterbendel@ep-damp-cell-91680705.eu-west-1.aws.neon.build:5432/neondb?sslmode=require in directory /home/ubuntu/cloud/bench/cmd/cloudbench/cmd/productionlikebench/configurations/default     {"unit": 3546}

2024-01-24T14:04:11.214Z        ERROR   cloudbench      execution of unit failed        {"unit": 3546, "error": "output is \"pgbench: setting random seed to 42\\npgbench (14.10 (Ubuntu 14.10-0ubuntu0.22.04.1), server 15.5)\\npgbench: error: client 0 aborted in command 5 (SQL) of script 0; perhaps the backend died while processing\\ntransaction type: pgbench_custom_readonly_txn.sql\\nscaling factor: 1\\nquery mode: prepared\\nnumber of clients: 1\\nnumber of threads: 1\\nduration: 60 s\\nnumber of transactions actually processed: 1302\\nlatency average = 276.126 ms\\ninitial connection time = 287.787 ms\\ntps = 3.621529 (without initial connection time)\\npgbench: fatal: Run was aborted; the above results are incomplete.\\n\", err is exit status 2"}

https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22dab%22:%7B%22datasourc[…]968000000%22,%22to%22:%221706140799000%22%7D%7D%7D&orgId=1

https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22pnf%22:%7B%22datasourc[…]22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1 https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22pnf%22:%7B%22datasourc[…]22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&orgId=1

2024-01-24T14:04:11.213550Z ERROR per-client task finished with an error: Connection reset by peer (os error 104) session_id=a52efa22-75a6-48d0-bf29-cf6baee67a83

Internal discussion

https://neondb.slack.com/archives/C039YKBRZB4/p1706104412540999?thread_ts=1704534532.501369&cid=C039YKBRZB4

https://neondb.slack.com/archives/C060N3SEF9D/p1706107025142509

stradig commented 7 months ago

We need to investigate the effect on customers. For now it is a P1

stradig commented 5 months ago

Anna, as discussed, please find out if this is still a problem.

conradludgate commented 5 months ago

Please re-open should this still be an issue for pgbench. Closing tentatively.