neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.36k stars 414 forks source link

Bug: vacuum fails error message: "could not establish connection to pageserver" "while scanning block 2037896 of relation "public.pgbench_accounts" during pgbench vacuum phase #6425

Open Bodobolero opened 7 months ago

Bodobolero commented 7 months ago

In the pgbench --initialize phase for two projects during vacuum I got the following error

Project https://console.stage.neon.tech/app/projects/sparkling-salad-20950028

could not establish connection to pageserver

2024-01-21T10:10:43.424Z        ERROR   cloudbench      execution of unit failed        {"unit": 2105, "error": "output is \"generating data (server-side)...\\nvacuuming...\\npgbench: fatal: query failed: ERROR:  [NEON_SMGR] could not establish connection to pageserver\\nDETAIL:  connection to server at \\\"pageserver-2.eu-west-1.aws.neon.build\\\" (10.10.77.2), port 6400 failed: Connection refused\\n\\tIs the server running on that host and accepting TCP/IP connections?\\nCONTEXT:  while scanning block 2037896 of relation \\\"public.pgbench_accounts\\\"\\npgbench: query was: vacuum analyze pgbench_accounts\\n\", err is exit status 1"}

https://console.stage.neon.tech/app/projects/floral-band-78249001

could not establish connection to pageserver

2024-01-21T20:41:16.477Z        ERROR   cloudbench      execution of unit failed        {"unit": 6417, "error": "output is \"generating data (server-side)...\\nvacuuming...\\npgbench: fatal: query failed: ERROR:  [NEON_SMGR] could not establish connection to pageserver\\nDETAIL:  connection to server at \\\"pageserver-2.eu-west-1.aws.neon.build\\\" (10.10.77.2), port 6400 failed: Connection refused\\n\\tIs the server running on that host and accepting TCP/IP connections?\\npgbench: query was: vacuum analyze pgbench_branches\\n\", err is exit status 1"}

A similar error "Connection refused" by page server was received during inserting data for project wandering-bird-29329855

2024-01-21T10:10:43.680Z        ERROR   cloudbench      execution of unit failed        {"unit": 3238, "error": "output is \"generating data (server-side)...\\npgbench: fatal: query failed: ERROR:  [NEON_SMGR] could not establish connection to pageserver\\nDETAIL:  connection to server at \\\"pageserver-2.eu-west-1.aws.neon.build\\\" (10.10.77.2), port 6400 failed: Connection refused\\n\\tIs the server running on that host and accepting TCP/IP connections?\\npgbench: query was: insert into pgbench_accounts(aid,bid,abalance,filler) select aid, (aid - 1) / 100000 + 1, 0, '' from generate_series(1, 176200000) as aid\\n\", err is exit status 1"}

Just had another one:

pgbench --initialize --init-steps=Gv -s 5755 --host ep-sweet-mouse-47852344.eu-west-1.aws.neon.build --port 5432 --username peterben
del neondb
generating data (server-side)...
vacuuming...
pgbench: fatal: query failed: ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-2.eu-west-1.aws.neon.build" (10.10.77.2), port 6400 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
CONTEXT:  while scanning block 10495924 of relation "public.pgbench_accounts"
pgbench: query was: vacuum analyze pgbench_accounts

Steps to reproduce

For context see issue #9483

This happened when ingesting data into 9230 projects, at a rate of 5 per minute using pgbench -- initialize Project "sparkling-salad-20950028" is a "large" project with 427 GB of resident size. The pgbench scale factor used is 7633. Project "floral-band-78249001" is a small project with pgbench scale factor 1.

Project "wandering-bird-29329855" is a "medium" size project with a scale factor of 1762 (trying to write approx. 80 GB)

Expected result

Vacuum phase of pgbench runs to completion.

Actual result

Vacuum phase aborted.

Environment

Staging

Logs, links

jcsp commented 5 months ago

@Bodobolero has this occurred again recently?

There were some changes to reconnect logic since the original report (e10a7ee3915c036bafd5dee5b57f7d02eed46b29), and also changes that improved restart times for pageservers (since at the time this issue occurred, it was running in a region with frequent restarts)

Bodobolero commented 5 months ago

I only created my projects once in EU-West-1. I am now creating projects again in US-East-2, let's see if we run into he vacuum error there again.