neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.28k stars 408 forks source link

Disconnection of pageserver causes assertion failure in endpoint #5734

Open alexanderlaw opened 10 months ago

alexanderlaw commented 10 months ago

Steps to reproduce

cargo neon init --force
cargo neon start
cargo neon tenant create --pg-version 16 --set-default
cargo neon endpoint create --pg-version 16 main

echo "shared_buffers = 1024MB" >> ".neon/endpoints/main/postgresql.conf"
cargo neon endpoint start main

createdb test
( sleep 1; cargo neon pageserver stop ) &
echo "
select pg_sleep(3);

create table t(b box);
create index t_idx on t using gist(b);
" | psql test

Result

CREATE TABLE
ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
WARNING:  AbortTransaction while in ABORT state
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

Logs, links

.neon/endpoints/main/compute.log contains:

...
2023-10-31 07:49:43.264 GMT [459970] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:49:43.264 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:49:43.264 GMT [459970] STATEMENT:  create index t_idx on t using gist(b);
...
2023-10-31 07:50:42.377 GMT [468868] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:42.377 GMT [468868] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.307 GMT [459970] ERROR:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:43.307 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.307 GMT [459970] STATEMENT:  create index t_idx on t using gist(b);
2023-10-31 07:50:43.308 GMT [459970] ERROR:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:43.308 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.308 GMT [459970] WARNING:  AbortTransaction while in ABORT state
TRAP: failed Assert("TransactionIdIsValid(proc->xid)"), File: ".../neon//vendor/postgres-v16/src/backend/storage/ipc/procarray.c", Line: 680, PID: 459970
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(__interceptor_backtrace+0x5e)[0x55e5c9cab08e]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(ExceptionalCondition+0x1c9)[0x55e5cce73dd9]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(ProcArrayEndTransaction+0xf0)[0x55e5cc17e910]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x2208aa6)[0x55e5ca349aa6]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(AbortCurrentTransaction+0x30f)[0x55e5ca35174f]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(PostgresMain+0xe57)[0x55e5cc2d8e67]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c269c1)[0x55e5cbd679c1]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c1c840)[0x55e5cbd5d840]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c1619a)[0x55e5cbd5719a]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(PostmasterMain+0x3b2e)[0x55e5cbd53dde]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(main+0xd1e)[0x55e5cb3aacce]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f41a15aed90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f41a15aee40]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(_start+0x25)[0x55e5c9c65615]
...
2023-10-31 07:50:56.860 GMT [459679] LOG:  server process (PID 459970) was terminated by signal 6: Aborted
2023-10-31 07:50:56.860 GMT [459679] DETAIL:  Failed process was running: create index t_idx on t using gist(b);
...

2023-10-31T07:50:57.204194Z  INFO checking for core dumps in .neon/endpoints/main/pgdata
2023-10-31T07:50:57.204321Z  WARN core dump found: .neon/endpoints/main/pgdata/core.459970, collecting backtrace
2023-10-31T07:50:58.038844Z  WARN core dump backtrace: [New LWP 459970]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX            '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:44
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139919872374720, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f41a15c7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f41a15ad7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055e5cce73e5b in ExceptionalCondition (conditionName=0x55e5cd52cda0 <str> "TransactionIdIsValid(proc->xid)", fileName=0x55e5cd52bea7 ".../neon//vendor/postgres-v16/src/backend/storage/ipc/procarray.c", lineNumber=680) at .../neon/vendor/postgres-v16/src/backend/utils/error/assert.c:66
#6  0x000055e5cc17e910 in ProcArrayEndTransaction (proc=0x7f419c7dd1f0, latestXid=1026) at .../neon/vendor/postgres-v16/src/backend/storage/ipc/procarray.c:680
#7  0x000055e5ca349aa6 in AbortTransaction () at .../neon/vendor/postgres-v16/src/backend/access/transam/xact.c:2834
#8  0x000055e5ca35174f in AbortCurrentTransaction () at .../neon/vendor/postgres-v16/src/backend/access/transam/xact.c:3338
#9  0x000055e5cc2d8e67 in PostgresMain (dbname=0x52900001b378 "test", username=0x52900001b358 "cloud_admin") at .../neon/vendor/postgres-v16/src/backend/tcop/postgres.c:4364
#10 0x000055e5cbd679c1 in BackendRun (port=0x515000003500) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:4464
#11 0x000055e5cbd5d840 in BackendStartup (port=0x515000003500) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:4192
#12 0x000055e5cbd5719a in ServerLoop () at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:1782
#13 0x000055e5cbd53dde in PostmasterMain (argc=3, argv=0x503000000a60) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:1466
#14 0x000055e5cb3aacce in main (argc=3, argv=0x503000000a60) at .../neon/vendor/postgres-v16/src/backend/main/main.c:238

With backtrace_functions = 'pageserver_connect' I see the following call stacks: For the first error: ... index_build -> index_update_stats -> RelationGetNumberOfBlocksInFork -> smgrnblocks -> neon_nblocks -> page_server_request -> pageserver_send -> pageserver_connect. For the second error: ... AbortCurrentTransaction -> AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> DropRelationsAllBuffers -> smgrexists -> neon_exists -> page_server_request -> pageserver_send -> pageserver_connect.

kelvich commented 10 months ago

thank you @alexanderlaw! cc @save-buffer

hlinnaka commented 2 months ago

Here's the repro as a python test:

diff --git a/test_runner/regress/test_pageserver_restart.py b/test_runner/regress/test_pageserver_restart.py
index 4ce53df21..4f2882335 100644
--- a/test_runner/regress/test_pageserver_restart.py
+++ b/test_runner/regress/test_pageserver_restart.py
@@ -218,3 +218,23 @@ def test_pageserver_chaos(
         # Check that all the updates are visible
         num_updates = endpoint.safe_psql("SELECT sum(updates) FROM foo")[0][0]
         assert num_updates == i * 100000
+
+# Aborting out of a transaction that has created new relations causes
+# a PANIC, if the pageserver cannot be reached. (And if the pageserver cannot
+# be reached, that causes the creation to abort in the first place.)
+#
+# repro for https://github.com/neondatabase/neon/issues/5734
+def test_pageserver_repro_5734(neon_env_builder: NeonEnvBuilder):
+    env = neon_env_builder.init_start()
+
+    endpoint = env.endpoints.create_start("main", config_lines=["shared_buffers='1GB'"])
+
+    with closing(endpoint.connect()) as conn:
+        with conn.cursor() as cur:
+            cur.execute("CREATE DATABASE test")
+
+    with closing(endpoint.connect(dbname="test")) as conn:
+        with conn.cursor() as cur:
+            env.pageserver.stop()
+            cur.execute("create table t(b box);")
+            cur.execute("create index t_idx on t using gist(b);")
hlinnaka commented 2 months ago

When pageserver cannot be reached, the CREATE INDEX fails while trying to update the relpages/reltuples after the index has otherwise already been built. The error causes the transaction to abort. During abort processing, the backend checks if the relation exists on disk:

#0  errstart (elevel=21, domain=0x0) at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/utils/error/elog.c:364
#1  0x00007ff20b1f332c in pageserver_connect (shard_no=0, elevel=21) at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:458
#2  0x00007ff20b1f40ec in pageserver_send (shard_no=0, request=0x7fff7139ba30)
    at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:746
#3  0x00007ff20b1faee2 in page_server_request (req=0x7fff7139ba30) at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:972
#4  0x00007ff20b1fca63 in neon_exists (reln=0x561c1136fab8, forkNum=MAIN_FORKNUM)
    at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:1940
#5  0x0000561c0fa71ab8 in smgrexists (reln=0x561c1136fab8, forknum=MAIN_FORKNUM)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:262
#6  0x0000561c0fa29893 in DropRelationsAllBuffers (smgr_reln=0x561c11349ba8, nlocators=1)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:3821
#7  0x0000561c0fa71e53 in smgrdounlinkall (rels=0x561c11349ba8, nrels=1, isRedo=false)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:445
#8  0x0000561c0f6722ce in smgrDoPendingDeletes (isCommit=false)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/catalog/storage.c:707
#9  0x0000561c0f5ef884 in AbortTransaction () at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:2861
#10 0x0000561c0f5f0196 in AbortCurrentTransaction ()
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:3339
#11 0x0000561c0fa7a1eb in PostgresMain (dbname=0x561c112f24c8 "test", username=0x561c112f24a8 "cloud_admin")
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/tcop/postgres.c:4364

That also fails because the pagserver cannot be reached. The error during abort processing causes the assertion failure. In a release build with assertions disabled, it's a warning:

PG:2024-06-14 18:56:16.851 GMT [1023354] WARNING:  AbortTransaction while in ABORT state

This could in theory happen with vanilla Postgres too, if there was a disk failure so that the calls to check if the just-created file exists fails. That's highly unlikely with a local disk though.

This would be nice to fix somehow, but it's pretty low priority.