neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.32k stars 446 forks source link

`test_restarts_under_load` failing #1343

Closed arssher closed 2 years ago

arssher commented 2 years ago

From https://github.com/zenithdb/zenith/pull/1280#issuecomment-1059826743

After poking it I see that sometimes during the whole iteration period (10s) some backend(s) never manage to finish the transaction. Walproposer looks healthy, seems like it is a client problem, as if it just never sends next query. e.g.

ars      2403558       1  0  6786 14904   1 Mar06 ?        00:00:00 /home/ars/zenith/zenith/tmp_install/bin/postgres -D /home/ars/zenith/zenith/test_output/test_restarts_under_load/repo/pgdatadirs/tenants/15f66b94744b4a3292d08f1b76a51037/test_wal_acceptors_restarts_under_load
ars      2403560 2403558  0  6817  4788   7 Mar06 ?        00:00:00  \_ postgres: checkpointer 
ars      2403561 2403558  0  6819  4424   6 Mar06 ?        00:00:00  \_ postgres: background writer 
ars      2403562 2403558  0  6786  2504   8 Mar06 ?        00:00:00  \_ postgres: walwriter 
ars      2403563 2403558  0  6962  7240  10 Mar06 ?        00:00:00  \_ postgres: autovacuum launcher 
ars      2403564 2403558  0  4084  2632   1 Mar06 ?        00:00:00  \_ postgres: stats collector 
ars      2403565 2403558  0  9527 16828   7 Mar06 ?        00:00:25  \_ postgres: WAL proposer streaming 0/32CB1F0
ars      2403566 2403558  0  6926  5636   2 Mar06 ?        00:00:00  \_ postgres: logical replication launcher 
ars      2403579 2403558  0  7238 14756   2 Mar06 ?        00:00:00  \_ postgres: zenith_admin postgres 127.0.0.1(35176) idle
ars      2403611 2403558  0  7089 15120   6 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35178) idle in transaction
ars      2403612 2403558  0  7091 14972   2 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35180) UPDATE waiting
ars      2403613 2403558  0  7091 15052   8 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35182) idle in transaction
ars      2403614 2403558  0  7089 15000   8 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35184) idle in transaction
ars      2403615 2403558  0  7091 15092  10 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35186) UPDATE waiting
ars      2403616 2403558  0  7091 14912   6 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35188) UPDATE waiting
ars      2403617 2403558  0  7090 15060   5 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35190) UPDATE waiting
ars      2403618 2403558  0  7091 15184   8 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35192) UPDATE waiting
ars      2403619 2403558  0  7090 13684   3 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35194) UPDATE waiting
ars      2403620 2403558  0  7090 14560   7 Mar06 ?        00:00:12  \_ postgres: zenith_admin postgres 127.0.0.1(35196) UPDATE waiting
ars      2404784 2403558  0  7628 17360   4 Mar06 ?        00:00:00  \_ postgres: zenith_admin postgres [local] idle

So almost everyone waits for 2403614, and that backend did first update and now waits for the client to perform the second

[local]:15014 zenith_admin@postgres:2404784=#  
select * from pg_stat_activity where pid = 2403614;
-[ RECORD 1 ]----+-----------------------------------------------------------
datid            | 12974
datname          | postgres
pid              | 2403614
leader_pid       | 
usesysid         | 10
usename          | zenith_admin
application_name | 
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 35184
backend_start    | 2022-03-06 20:07:51.061371+00
xact_start       | 2022-03-06 20:08:48.642718+00
query_start      | 2022-03-06 20:08:48.643332+00
state_change     | 2022-03-06 20:08:58.971861+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle in transaction
backend_xid      | 100966
backend_xmin     | 
query_id         | 
query            | UPDATE bank_accs SET amount = amount + ($1) WHERE uid = $2
backend_type     | client backend

Not sure how that's possible.

petuhovskiy commented 2 years ago

I tried to reproduce your observations, got similar results, but for me it seems more like there is some problem with locks in postgres, rather than client issue.

From my observations, iteration usually have some progress in about ~1 second, or doesn't have progress at all for 10-20 seconds (that's why test is failing). What I tried to do was to collect various useful info once there is no progress in iteration. I do this by running psql -c "SELECT * FROM pg_stat_activity", and strangely enough, this SELECT hangs for 10-20 seconds and completes only after iteration starts to show progress. It was reproduced for several times, here's an example:

19:40:37 - new iteration has no progress (txes are stuck)

ps ax shows this:

 114225 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43178) UPDATE waiting
 114227 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43180) UPDATE waiting
 114228 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43182) UPDATE waiting
 114229 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43184) UPDATE waiting
 114230 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43186) UPDATE waiting
 114231 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43188) UPDATE
 114232 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43190) UPDATE waiting
 114233 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43192) UPDATE waiting
 114235 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43194) UPDATE waiting
 114236 ?        Ss     0:18 postgres: zenith_admin postgres 127.0.0.1(43196) UPDATE

psql for pg_stat_activity responds only at 19:40:51, when iteration has already finished, reports this:

datid | datname  |  pid   | leader_pid | usesysid |   usename    | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |     wait_event      |        state        | backend_xid | backend_xmin | query_id |                           query                            |         backend_type         
-------+----------+--------+------------+----------+--------------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+----------+------------------------------------------------------------+------------------------------
       |          | 114046 |            |          |              |                  |             |                 |             | 2022-03-08 19:38:26.821149+00 |                               |                               |                               | Activity        | AutoVacuumMain      |                     |             |              |          |                                                            | autovacuum launcher
       |          | 114049 |            |       10 | zenith_admin |                  |             |                 |             | 2022-03-08 19:38:26.821515+00 |                               |                               |                               | Activity        | LogicalLauncherMain |                     |             |              |          |                                                            | logical replication launcher
 13008 | postgres | 114145 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43154 | 2022-03-08 19:38:36.926781+00 |                               | 2022-03-08 19:40:47.030632+00 | 2022-03-08 19:40:47.030846+00 | Client          | ClientRead          | idle                |             |              |          | SELECT pg_current_wal_flush_lsn()                          | client backend
 13008 | postgres | 114225 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43178 | 2022-03-08 19:38:39.047488+00 | 2022-03-08 19:40:51.076809+00 | 2022-03-08 19:40:51.076809+00 | 2022-03-08 19:40:51.076885+00 | Client          | ClientRead          | idle in transaction |             |              |          | BEGIN;                                                     | client backend
 13008 | postgres | 114227 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43180 | 2022-03-08 19:38:39.047659+00 | 2022-03-08 19:40:51.075756+00 | 2022-03-08 19:40:51.076598+00 | 2022-03-08 19:40:51.076717+00 |                 |                     | idle in transaction |      261679 |              |          | UPDATE bank_accs SET amount = amount - ($1) WHERE uid = $2 | client backend
 13008 | postgres | 114228 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43182 | 2022-03-08 19:38:39.048009+00 | 2022-03-08 19:40:50.459575+00 | 2022-03-08 19:40:51.076975+00 | 2022-03-08 19:40:51.076976+00 | IPC             | SyncRep             | active              |      261578 |              |          | COMMIT;                                                    | client backend
 13008 | postgres | 114230 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43186 | 2022-03-08 19:38:39.048279+00 | 2022-03-08 19:40:51.076312+00 | 2022-03-08 19:40:51.076765+00 | 2022-03-08 19:40:51.076996+00 | Client          | ClientRead          | idle in transaction |      261681 |              |          | UPDATE bank_accs SET amount = amount + ($1) WHERE uid = $2 | client backend
 13008 | postgres | 114232 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43190 | 2022-03-08 19:38:39.048398+00 | 2022-03-08 19:40:50.813095+00 | 2022-03-08 19:40:50.813393+00 | 2022-03-08 19:40:50.813393+00 | Lock            | transactionid       | active              |      261661 |       261467 |          | UPDATE bank_accs SET amount = amount - ($1) WHERE uid = $2 | client backend
 13008 | postgres | 114235 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43194 | 2022-03-08 19:38:39.048663+00 | 2022-03-08 19:40:51.076451+00 | 2022-03-08 19:40:51.076968+00 | 2022-03-08 19:40:51.076969+00 | Client          | ClientRead          | active              |             |       261578 |          | UPDATE bank_accs SET amount = amount + ($1) WHERE uid = $2 | client backend
 13008 | postgres | 114233 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43192 | 2022-03-08 19:38:39.048537+00 | 2022-03-08 19:40:51.075164+00 | 2022-03-08 19:40:51.076839+00 | 2022-03-08 19:40:51.076839+00 | IPC             | SyncRep             | active              |      261677 |              |          | COMMIT;                                                    | client backend
 13008 | postgres | 114231 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43188 | 2022-03-08 19:38:39.049858+00 | 2022-03-08 19:40:51.075224+00 | 2022-03-08 19:40:51.076639+00 | 2022-03-08 19:40:51.076725+00 | IPC             | SyncRep             | idle in transaction |      261678 |              |          | INSERT INTO bank_log VALUES ($1, $2, $3)                   | client backend
 13008 | postgres | 114236 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43196 | 2022-03-08 19:38:39.048776+00 | 2022-03-08 19:40:51.075804+00 | 2022-03-08 19:40:51.076184+00 | 2022-03-08 19:40:51.076184+00 | Lock            | transactionid       | active              |      261680 |       261578 |          | UPDATE bank_accs SET amount = amount + ($1) WHERE uid = $2 | client backend
 13008 | postgres | 114229 |            |       10 | zenith_admin |                  | 127.0.0.1   |                 |       43184 | 2022-03-08 19:38:39.04795+00  | 2022-03-08 19:40:51.074831+00 | 2022-03-08 19:40:51.076655+00 | 2022-03-08 19:40:51.076656+00 | IPC             | SyncRep             | active              |      261676 |              |          | COMMIT;                                                    | client backend
 13008 | postgres | 115100 |            |       10 | zenith_admin | psql             | 127.0.0.1   |                 |       46174 | 2022-03-08 19:40:38.111575+00 | 2022-03-08 19:40:50.197456+00 | 2022-03-08 19:40:50.197456+00 | 2022-03-08 19:40:50.197457+00 |                 |                     | active              |             |       261578 |          | SELECT * FROM pg_stat_activity                             | client backend
       |          | 114044 |            |          |              |                  |             |                 |             | 2022-03-08 19:38:26.820909+00 |                               |                               |                               | Activity        | BgWriterMain        |                     |             |              |          |                                                            | background writer
       |          | 114043 |            |          |              |                  |             |                 |             | 2022-03-08 19:38:26.820495+00 |                               |                               |                               | Activity        | CheckpointerMain    |                     |             |              |          |                                                            | checkpointer
       |          | 114045 |            |          |              |                  |             |                 |             | 2022-03-08 19:38:26.820935+00 |                               |                               |                               | Activity        | WalWriterMain       |                     |             |              |          |                                                            | walwriter
(17 rows)

Not sure if pg_stat_activity tries to acquire some common lock or postgres is just unresponsive for processing transactions and new connections.

petuhovskiy commented 2 years ago

Tried to dig deeper, seems that I know where bug is now, but not entirely sure yet.

The issue is still not stable to reproduce, but I managed to get backtrace of SELECT * FROM pg_stat_activity backend when everything is stuck:

#0  0x00007f690dd7f58a in epoll_wait (epfd=10, events=0x55fd6c453568, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055fd6ac23956 in WaitEventSetWaitBlock (set=0x55fd6c453508, cur_timeout=-1, occurred_events=0x7ffd5d97a050, nevents=1) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/ipc/latch.c:1450
#2  0x000055fd6ac237d1 in WaitEventSetWait (set=0x55fd6c453508, timeout=-1, occurred_events=0x7ffd5d97a050, nevents=1, wait_event_info=134217734) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/ipc/latch.c:1396
#3  0x000055fd6ac22b11 in WaitLatch (latch=0x7f690d6a8ff4, wakeEvents=33, timeout=-1, wait_event_info=134217734) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/ipc/latch.c:473
#4  0x000055fd6ac35ae6 in ConditionVariableTimedSleep (cv=0x7f690d2f2ca0, timeout=-1, wait_event_info=134217734) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/lmgr/condition_variable.c:163
#5  0x000055fd6ac35a14 in ConditionVariableSleep (cv=0x7f690d2f2ca0, wait_event_info=134217734) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/lmgr/condition_variable.c:100
#6  0x000055fd6ac0fffa in WaitIO (buf=0x7f690d1f1780) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/buffer/bufmgr.c:4427
#7  0x000055fd6ac10088 in StartBufferIO (buf=0x7f690d1f1780, forInput=true) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/buffer/bufmgr.c:4464
#8  0x000055fd6ac0a636 in BufferAlloc (smgr=0x55fd6c4f53f0, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, strategy=0x0, foundPtr=0x7ffd5d97a261) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/buffer/bufmgr.c:1165
#9  0x000055fd6ac09aaf in ReadBuffer_common (smgr=0x55fd6c4f53f0, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0, hit=0x7ffd5d97a383) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/buffer/bufmgr.c:864
#10 0x000055fd6ac09765 in ReadBufferExtended (reln=0x7f690d10a168, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/storage/buffer/bufmgr.c:771
#11 0x000055fd6a77da6d in heapgetpage (sscan=0x55fd6c5252a8, page=0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/access/heap/heapam.c:400
#12 0x000055fd6a77f63d in heapgettup_pagemode (scan=0x55fd6c5252a8, dir=ForwardScanDirection, nkeys=0, key=0x0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/access/heap/heapam.c:899
#13 0x000055fd6a780ea3 in heap_getnextslot (sscan=0x55fd6c5252a8, direction=ForwardScanDirection, slot=0x55fd6c524c80) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/access/heap/heapam.c:1396
#14 0x000055fd6aa3f537 in table_scan_getnextslot (sscan=0x55fd6c5252a8, direction=ForwardScanDirection, slot=0x55fd6c524c80) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/include/access/tableam.h:1044
#15 0x000055fd6aa3f603 in SeqNext (node=0x55fd6c524ad0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeSeqscan.c:80
#16 0x000055fd6a9febf5 in ExecScanFetch (node=0x55fd6c524ad0, accessMtd=0x55fd6aa3f56b <SeqNext>, recheckMtd=0x55fd6aa3f614 <SeqRecheck>) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execScan.c:133
#17 0x000055fd6a9fec9a in ExecScan (node=0x55fd6c524ad0, accessMtd=0x55fd6aa3f56b <SeqNext>, recheckMtd=0x55fd6aa3f614 <SeqRecheck>) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execScan.c:199
#18 0x000055fd6aa3f66a in ExecSeqScan (pstate=0x55fd6c524ad0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeSeqscan.c:112
#19 0x000055fd6a9faa97 in ExecProcNodeFirst (node=0x55fd6c524ad0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execProcnode.c:463
#20 0x000055fd6aa1df99 in ExecProcNode (node=0x55fd6c524ad0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/include/executor/executor.h:257
#21 0x000055fd6aa1e267 in MultiExecPrivateHash (node=0x55fd6c5ae670) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHash.c:166
#22 0x000055fd6aa1e1cb in MultiExecHash (node=0x55fd6c5ae670) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHash.c:115
#23 0x000055fd6a9fab8a in MultiExecProcNode (node=0x55fd6c5ae670) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execProcnode.c:519
#24 0x000055fd6aa24c6f in ExecHashJoinImpl (pstate=0x55fd6c5a0e18, parallel=false) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHashjoin.c:291
#25 0x000055fd6aa253da in ExecHashJoin (pstate=0x55fd6c5a0e18) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHashjoin.c:591
#26 0x000055fd6a9faa97 in ExecProcNodeFirst (node=0x55fd6c5a0e18) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execProcnode.c:463
#27 0x000055fd6aa24841 in ExecProcNode (node=0x55fd6c5a0e18) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/include/executor/executor.h:257
#28 0x000055fd6aa24ba6 in ExecHashJoinImpl (pstate=0x55fd6c5a0b10, parallel=false) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHashjoin.c:262
#29 0x000055fd6aa253da in ExecHashJoin (pstate=0x55fd6c5a0b10) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/nodeHashjoin.c:591
#30 0x000055fd6a9faa97 in ExecProcNodeFirst (node=0x55fd6c5a0b10) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execProcnode.c:463
#31 0x000055fd6a9eea73 in ExecProcNode (node=0x55fd6c5a0b10) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/include/executor/executor.h:257
#32 0x000055fd6a9f14ac in ExecutePlan (estate=0x55fd6c5a08a0, planstate=0x55fd6c5a0b10, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x55fd6c456e18, execute_once=true) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execMain.c:1551
#33 0x000055fd6a9ef14d in standard_ExecutorRun (queryDesc=0x55fd6c52c570, direction=ForwardScanDirection, count=0, execute_once=true) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execMain.c:361
#34 0x000055fd6a9eef5c in ExecutorRun (queryDesc=0x55fd6c52c570, direction=ForwardScanDirection, count=0, execute_once=true) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/executor/execMain.c:305
#35 0x000055fd6ac6124b in PortalRunSelect (portal=0x55fd6c4c0020, forward=true, count=0, dest=0x55fd6c456e18) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/tcop/pquery.c:921
#36 0x000055fd6ac60e84 in PortalRun (portal=0x55fd6c4c0020, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x55fd6c456e18, altdest=0x55fd6c456e18, qc=0x7ffd5d97ac30) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/tcop/pquery.c:765
#37 0x000055fd6ac5bbca in exec_execute_message (portal_name=0x55fd6c456a00 "", max_rows=9223372036854775807) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/tcop/postgres.c:2197
#38 0x000055fd6ac5ef06 in PostgresMain (argc=1, argv=0x7ffd5d97ae80, dbname=0x55fd6c484038 "postgres", username=0x55fd6c484010 "zenith_admin") at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/tcop/postgres.c:4579
#39 0x000055fd6ab845ff in BackendRun (port=0x55fd6c47d430) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/postmaster/postmaster.c:4536
#40 0x000055fd6ab83eb8 in BackendStartup (port=0x55fd6c47d430) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/postmaster/postmaster.c:4258
#41 0x000055fd6ab7ffa0 in ServerLoop () at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/postmaster/postmaster.c:1751
#42 0x000055fd6ab7f739 in PostmasterMain (argc=3, argv=0x55fd6c4524e0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/postmaster/postmaster.c:1423
#43 0x000055fd6aa74c68 in main (argc=3, argv=0x55fd6c4524e0) at /home/arthur/zenith/tmp_install/build/../../vendor/postgres/src/backend/main/main.c:216

It seems that this backend blocked on waiting for shared buffer, so I increased shared_buffers to 512MB and tried to reproduce issue again.

After that, issue started taking more time to reproduce, but test now fails on ~25 iteration and everything hangs for a substantial time (no progress for several minutes, at least). That's convinient for debug, and I got backtrace for all postgres processes when everything is stuck.

Here's ps ax and backtraces of all postgres processes.

All processes are waiting for something, and that is one of:

After one minute in stuck state, pageserver errors with:

2022-03-11T09:41:28.615078Z ERROR pagestream{timeline=e0663c93c24bf813ca8fe4ee74805b87 tenant=6cb477bf02b94fde8fba3be27f7b7d5e}: error reading relation or page version: Timed out while waiting for WAL record at LSN 0/E814A78 to arrive, last_record_lsn 0/C89C390 disk consistent LSN=0/169AFA8

Caused by:
    SeqWaitError

The most obvious explanation for this is that WAL is stuck on safekeeper and recent logs are not streamed to pageserver, and there's likely a bug somewhere nearby send_wal.rs

petuhovskiy commented 2 years ago

I tried bisecting the issue and it got me to fa8a6c0e94405f03345293f43a6b3ad74c734750, i.e. issue appeared when we disabled debug logs in safekeeper (that increased safekeeper perf in 1.5-2 times in local tests). Without logs issue is reproduced for commits from January, so it's not a recent bug.

I added metric for last_record_lsn in pageserver and collected all prometheus metrics while test is running. Here're results for shared_buffers=512MB: image Red line here is last_record_lsn in pageserver and green/yellow/blue are flush_lsn from safekeepers.

So, it looks like pageserver isn't able to keep up with safekeepers and LSN gap is increasing between them. After some time (probably when shared_buffers are filled up) compute starts making requests which call wait_lsn() in pageserver and transactions hang at this moment. On the graph it looks like blue line became flat in the end.

Same metrics when shared_buffers=1MB:

image

This time pageserver also falls behind, but wait_lsn() is called more frequently and gap doesn't grow too big and usually is resolved in a minute. Sometimes gap is large enough to wait 10s or more, this is exactly the case when test_restarts_under_load fails.

The situation where transactions hang for 10+ seconds (because compute needs to wait for pageserver to catch up) doesn't look good, I wonder if backpressure supposed to solve this? For shared_buffers=1MB problem doesn't look too bad, because everything continues to work normally after 10 seconds. For shared_buffers=512MB situation is worse, because requests are timed out after 1 minute and this error aborts the transaction.

Not sure what is the next step for this issue.

arssher commented 2 years ago

Nice investigation. Looks like we should harden back pressure settings. CC @knizhnik

stepashka commented 2 years ago

we'll harden the settings once we see a test failure again

petuhovskiy commented 2 years ago

Test failed on main, logs have nothing suspicious but it seems that pageserver is lagging behind:

pytest logs:
2022-03-28 11:03:59.839 INFO [test_wal_acceptor_async.py:122] Safekeeper at port 15302 has flush_lsn 0/38F8550, waiting for lsn 0/38F8428

pageserver logs:
2022-03-28T11:04:10.249026Z  INFO WAL receiver{timeline=de7ca380458cd3988596165d9a630b8f tenant=08767d593179452881f06dbd8720f5b7}: caught up at LSN 0/312D2A8