Open alexanderlaw opened 10 months ago
thank you @alexanderlaw! cc @save-buffer
Here's the repro as a python test:
diff --git a/test_runner/regress/test_pageserver_restart.py b/test_runner/regress/test_pageserver_restart.py
index 4ce53df21..4f2882335 100644
--- a/test_runner/regress/test_pageserver_restart.py
+++ b/test_runner/regress/test_pageserver_restart.py
@@ -218,3 +218,23 @@ def test_pageserver_chaos(
# Check that all the updates are visible
num_updates = endpoint.safe_psql("SELECT sum(updates) FROM foo")[0][0]
assert num_updates == i * 100000
+
+# Aborting out of a transaction that has created new relations causes
+# a PANIC, if the pageserver cannot be reached. (And if the pageserver cannot
+# be reached, that causes the creation to abort in the first place.)
+#
+# repro for https://github.com/neondatabase/neon/issues/5734
+def test_pageserver_repro_5734(neon_env_builder: NeonEnvBuilder):
+ env = neon_env_builder.init_start()
+
+ endpoint = env.endpoints.create_start("main", config_lines=["shared_buffers='1GB'"])
+
+ with closing(endpoint.connect()) as conn:
+ with conn.cursor() as cur:
+ cur.execute("CREATE DATABASE test")
+
+ with closing(endpoint.connect(dbname="test")) as conn:
+ with conn.cursor() as cur:
+ env.pageserver.stop()
+ cur.execute("create table t(b box);")
+ cur.execute("create index t_idx on t using gist(b);")
When pageserver cannot be reached, the CREATE INDEX
fails while trying to update the relpages/reltuples after the index has otherwise already been built. The error causes the transaction to abort. During abort processing, the backend checks if the relation exists on disk:
#0 errstart (elevel=21, domain=0x0) at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/utils/error/elog.c:364
#1 0x00007ff20b1f332c in pageserver_connect (shard_no=0, elevel=21) at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:458
#2 0x00007ff20b1f40ec in pageserver_send (shard_no=0, request=0x7fff7139ba30)
at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:746
#3 0x00007ff20b1faee2 in page_server_request (req=0x7fff7139ba30) at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:972
#4 0x00007ff20b1fca63 in neon_exists (reln=0x561c1136fab8, forkNum=MAIN_FORKNUM)
at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:1940
#5 0x0000561c0fa71ab8 in smgrexists (reln=0x561c1136fab8, forknum=MAIN_FORKNUM)
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:262
#6 0x0000561c0fa29893 in DropRelationsAllBuffers (smgr_reln=0x561c11349ba8, nlocators=1)
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:3821
#7 0x0000561c0fa71e53 in smgrdounlinkall (rels=0x561c11349ba8, nrels=1, isRedo=false)
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:445
#8 0x0000561c0f6722ce in smgrDoPendingDeletes (isCommit=false)
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/catalog/storage.c:707
#9 0x0000561c0f5ef884 in AbortTransaction () at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:2861
#10 0x0000561c0f5f0196 in AbortCurrentTransaction ()
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:3339
#11 0x0000561c0fa7a1eb in PostgresMain (dbname=0x561c112f24c8 "test", username=0x561c112f24a8 "cloud_admin")
at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/tcop/postgres.c:4364
That also fails because the pagserver cannot be reached. The error during abort processing causes the assertion failure. In a release build with assertions disabled, it's a warning:
PG:2024-06-14 18:56:16.851 GMT [1023354] WARNING: AbortTransaction while in ABORT state
This could in theory happen with vanilla Postgres too, if there was a disk failure so that the calls to check if the just-created file exists fails. That's highly unlikely with a local disk though.
This would be nice to fix somehow, but it's pretty low priority.
Steps to reproduce
Result
Logs, links
.neon/endpoints/main/compute.log contains:
With backtrace_functions = 'pageserver_connect' I see the following call stacks: For the first error: ... index_build -> index_update_stats -> RelationGetNumberOfBlocksInFork -> smgrnblocks -> neon_nblocks -> page_server_request -> pageserver_send -> pageserver_connect. For the second error: ... AbortCurrentTransaction -> AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> DropRelationsAllBuffers -> smgrexists -> neon_exists -> page_server_request -> pageserver_send -> pageserver_connect.