neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.15k stars 442 forks source link

Postgres segfault if pageserver is detached while queries are running #3231

Closed hlinnaka closed 1 year ago

hlinnaka commented 1 year ago

I wrote a test case that detaches and re-attaches pageserver, while running queries. It segfaults:

2022-12-29 18:32:10.535 GMT [2289306] LOG:  standby "walproposer" is now a synchronous standby with priority 1
2022-12-29 18:32:10.606 GMT [2289312] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.514 GMT [2289357] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.516 GMT [2289362] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.518 GMT [2289361] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.519 GMT [2289366] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.520 GMT [2289365] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.521 GMT [2289358] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.526 GMT [2289359] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.532 GMT [2289364] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:11.536 GMT [2289360] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:12.896 GMT [2289363] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.551 GMT [2289357] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.551 GMT [2289363] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.553 GMT [2289361] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.553 GMT [2289358] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.553 GMT [2289365] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.554 GMT [2289360] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.554 GMT [2289359] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.554 GMT [2289366] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.554 GMT [2289362] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.554 GMT [2289364] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.788 GMT [2289357] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.788 GMT [2289363] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.788 GMT [2289357] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.788 GMT [2289363] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289359] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289359] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289361] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289361] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289362] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289362] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289358] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289358] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289366] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289366] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.789 GMT [2289364] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.789 GMT [2289364] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.798 GMT [2289365] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.798 GMT [2289365] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:14.802 GMT [2289360] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user:@localhost:15002'
2022-12-29 18:32:14.802 GMT [2289360] LOG:  [NEON_SMGR] dropping connection to page server due to error
2022-12-29 18:32:15.537 GMT [2289299] LOG:  server process (PID 2289360) was terminated by signal 11: Segmentation fault
2022-12-29 18:32:15.537 GMT [2289299] DETAIL:  Failed process was running: UPDATE t SET counter = counter + 1 WHERE id = 51423
2022-12-29 18:32:15.537 GMT [2289299] LOG:  terminating any other active server processes
2022-12-29 18:32:16.283 GMT [2289299] LOG:  received fast shutdown request
2022-12-29 18:32:17.505 GMT [2289299] LOG:  abnormal database system shutdown
2022-12-29 18:32:17.667 GMT [2289299] LOG:  database system is shut down

I got a core dump but it's garbled:

Core was generated by `postgres: cloud_admin postgres 127.0.0.1(46394) UPDATE                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:111
111 ../sysdeps/x86_64/__longjmp.S: No such file or directory.
(gdb) bt
#0  __longjmp () at ../sysdeps/x86_64/__longjmp.S:111
#1  0x27b4f4dc58f3e00c in ?? ()
Backtrace stopped: Cannot access memory at address 0x68ad74dc701a2964

Test case is here: https://github.com/neondatabase/neon/compare/main...add-pageserver-reattach-test

hlinnaka commented 1 year ago

I was able to capture and step through this with 'rr'. The error really happens inthe 'longjmp' call:

#0  pg_re_throw () at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/error/elog.c:1803
#1  0x0000561ea166938e in errfinish (filename=0x7f43811884a9 "libpagestore.c", lineno=222, funcname=0x7f4381188de0 <__func__.4> "pageserver_send")
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/error/elog.c:593
#2  0x00007f4381178df9 in pageserver_send (request=0x7ffdea2d5d10) at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:222
#3  0x00007f438117cb1a in prefetch_do_request (slot=0x561ea2052398, force_latest=0x7ffdea2d5e0c, force_lsn=0x7ffdea2d5e10) at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:691
#4  0x00007f438117d216 in prefetch_register_buffer (tag=..., force_latest=0x7ffdea2d5e0c, force_lsn=0x7ffdea2d5e10) at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:857
#5  0x00007f438117f241 in neon_read_at_lsn (rnode=..., forkNum=MAIN_FORKNUM, blkno=2, request_lsn=23694528, request_latest=true, buffer=0x7f4380254d80 "")
    at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:1893
#6  0x00007f438117f848 in neon_read (reln=0x561ea20c2380, forkNum=MAIN_FORKNUM, blkno=2, buffer=0x7f4380254d80 "") at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:1992
#7  0x0000561ea14c7727 in smgrread (reln=0x561ea20c2380, forknum=MAIN_FORKNUM, blocknum=2, buffer=0x7f4380254d80 "")
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/storage/smgr/smgr.c:515
#8  0x0000561ea147efa3 in ReadBuffer_common (smgr=0x561ea20c2380, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=2, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffdea2d607b)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1029
#9  0x0000561ea147e6d8 in ReadBufferExtended (reln=0x7f438110b710, forkNum=MAIN_FORKNUM, blockNum=2, mode=RBM_NORMAL, strategy=0x0)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:782
#10 0x0000561ea147e5a9 in ReadBuffer (reln=0x7f438110b710, blockNum=2) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:713
#11 0x0000561ea14800d7 in ReleaseAndReadBuffer (buffer=0, relation=0x7f438110b710, blockNum=2) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/storage/buffer/bufmgr.c:1686
#12 0x0000561ea103eeac in heapam_index_fetch_tuple (scan=0x561ea20a1d80, tid=0x561ea20a19e8, snapshot=0x561ea20a1ce8, slot=0x561ea20a2720, call_again=0x561ea20a19ee, 
    all_dead=0x7ffdea2d619e) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/heap/heapam_handler.c:130
#13 0x0000561ea10598c3 in table_index_fetch_tuple (scan=0x561ea20a1d80, tid=0x561ea20a19e8, snapshot=0x561ea20a1ce8, slot=0x561ea20a2720, call_again=0x561ea20a19ee, all_dead=0x7ffdea2d619e)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/include/access/tableam.h:1224
#14 0x0000561ea105aa4c in index_fetch_heap (scan=0x561ea20a1988, slot=0x561ea20a2720) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/index/indexam.c:580
#15 0x0000561ea105ab85 in index_getnext_slot (scan=0x561ea20a1988, direction=ForwardScanDirection, slot=0x561ea20a2720)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/index/indexam.c:640
#16 0x0000561ea1059092 in systable_getnext (sysscan=0x561ea20a1930) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/index/genam.c:508
#17 0x0000561ea1645331 in SearchCatCacheMiss (cache=0x561ea2079880, nkeys=1, hashValue=3888260526, hashIndex=46, v1=2965, v2=0, v3=0, v4=0)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/catcache.c:1371
#18 0x0000561ea16451de in SearchCatCacheInternal (cache=0x561ea2079880, nkeys=1, v1=2965, v2=0, v3=0, v4=0)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/catcache.c:1302
#19 0x0000561ea1644ed9 in SearchCatCache1 (cache=0x561ea2079880, v1=2965) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/catcache.c:1170
#20 0x0000561ea16617ef in SearchSysCache1 (cacheId=32, key1=2965) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/syscache.c:1134
#21 0x0000561ea1654521 in RelationInitIndexAccessInfo (relation=0x7f437f6449f8) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/relcache.c:1433
#22 0x0000561ea1653ed6 in RelationBuildDesc (targetRelId=2965, insertIt=true) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/relcache.c:1211
#23 0x0000561ea1655f26 in RelationIdGetRelation (relationId=2965) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/cache/relcache.c:2099
#24 0x0000561ea0fd2408 in relation_open (relationId=2965, lockmode=1) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/common/relation.c:59
#25 0x0000561ea10598e2 in index_open (relationId=2965, lockmode=1) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/index/indexam.c:136
#26 0x0000561ea1058d8c in systable_beginscan (heapRelation=0x7f437f643578, indexId=2965, indexOK=true, snapshot=0x561ea20a1c50, nkeys=2, key=0x7ffdea2d66d0)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/access/index/genam.c:395
#27 0x0000561ea1121aa3 in ApplySetting (snapshot=0x561ea20a1c50, databaseid=12990, roleid=10, relsetting=0x7f437f643578, source=PGC_S_DATABASE_USER)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/catalog/pg_db_role_setting.c:238
#28 0x0000561ea167f444 in process_settings (databaseid=12990, roleid=10) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/init/postinit.c:1167
#29 0x0000561ea167f1ec in InitPostgres (in_dbname=0x561ea204df00 "postgres", dboid=0, username=0x561ea204ded8 "cloud_admin", useroid=0, out_dbname=0x0, override_allow_connections=false)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/utils/init/postinit.c:1054
#30 0x0000561ea14cef07 in PostgresMain (argc=1, argv=0x7ffdea2d6a40, dbname=0x561ea204df00 "postgres", username=0x561ea204ded8 "cloud_admin")
    at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/tcop/postgres.c:4102
#31 0x0000561ea1404d0a in BackendRun (port=0x561ea20446e0) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/postmaster/postmaster.c:4530
#32 0x0000561ea140463c in BackendStartup (port=0x561ea20446e0) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/postmaster/postmaster.c:4252
#33 0x0000561ea1400813 in ServerLoop () at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/postmaster/postmaster.c:1745
#34 0x0000561ea1400060 in PostmasterMain (argc=3, argv=0x561ea2019510) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/postmaster/postmaster.c:1417
#35 0x0000561ea12fda16 in main (argc=3, argv=0x561ea2019510) at /home/heikki/git-sandbox/neon//vendor/postgres-v14/src/backend/main/main.c:249