pglogrepl_demo: how to handle postgres server disconnect

nicksanford commented 3 years ago

Extremely cool project! Thank you very much for creating it.

I'm a go beginner & I'm trying to learn by using your library to create a toy version of debezium that serializes the logical replication messages to json & writes them to stdout. The demo made it very approachable & easy to get started so thank you again! Its been a fantastic way to learn go.

I've got most of it working except the case when the connection with postgres is broken and I'm not sure how to make forward progress.

If this is the wrong forum to ask this question please let me know.

My desired behavior is that if my program looses connection to the postgres server, I'd like the program to terminate with an error (just like the demo does if it can't connect in the first place). I'm simulating this by simply shutting down the postgres server I'm connected to after the program has successfully connected to the postgres server.

The problem I'm running in to is that I can't figure out how to determine if the connection with postgres is still alive or not and I'm confused because it appears that calls to conn.ReceiveMessage are returning successfully immediately, even if the postgres server shut down after connection.

I'm observing is that if the postgess server shuts down after the demo has connected, the call to conn.ReceiveMessage doesn't return an error (which is what I, probably ignorantly, would have expected), instead it returns immediately with a PrimaryKeepaliveMessage https://github.com/jackc/pglogrepl/blob/master/example/pglogrepl_demo/main.go#L75. Logs from the demo look like:

➜  pgreplicator PGLOGREPL_DEMO_CONN_STRING="postgres://localhost/nicholassanford?replication=database" go run main.go
2021/08/22 15:51:17 create publication pglogrepl_demo
2021/08/22 15:51:17 SystemID: 6917633139609526518 Timeline: 1 XLogPos: 0/97F6FC0 DBName: nicholassanford
2021/08/22 15:51:17 Created temporary replication slot: pglogrepl_demo
2021/08/22 15:51:17 Logical replication started on slot pglogrepl_demo
2021/08/22 15:51:17 Primary Keepalive Message => ServerWALEnd: 0/97F6FF8 ServerTime: 2021-08-22 15:51:17.749065 -0400 EDT ReplyRequested: false

Then when I shut down the postgres server the following log is output over & over as the nextStandbyMessageDeadline is always in the past & conn.ReceiveMessage always returns immediately with no error (I don't understand why):

2021/08/22 15:51:23 Primary Keepalive Message => ServerWALEnd: 0/97F6FF8 ServerTime: 2021-08-22 15:51:23.415861 -0400 EDT ReplyRequested: true
2021/08/22 15:51:23 Sent Standby status message
2021/08/22 15:51:23 Primary Keepalive Message => ServerWALEnd: 0/97F6FF8 ServerTime: 2021-08-22 15:51:23.416258 -0400 EDT ReplyRequested: true
2021/08/22 15:51:23 Sent Standby status message

So my questions are:

How should I determine if the connection to the postgres server is still alive & healthy?
Why is conn.ReceiveMessage returning a result when the postgres server it is requesting to receive messages from is no longer alive to send messages?

I'm using the following versions:

go 1.16

require (
  github.com/jackc/pgconn v1.6.5-0.20200823013804-5db484908cf7 // indirect
  github.com/jackc/pglogrepl v0.0.0-20210731151948-9f1effd582c4 // indirect
  github.com/jackc/pgproto3 v1.1.0 // indirect
  github.com/jackc/pgproto3/v2 v2.0.4 // indirect
)

jackc commented 3 years ago

Can you double check that the server is really stopped? This is what I get when I stop my PostgreSQL server.

2021/08/28 09:24:51 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:24:51.88443 -0500 CDT ReplyRequested: false
2021/08/28 09:24:51 Sent Standby status message
2021/08/28 09:24:51 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:24:51.990202 -0500 CDT ReplyRequested: false
2021/08/28 09:25:01 Sent Standby status message
2021/08/28 09:25:01 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:25:01.995287 -0500 CDT ReplyRequested: false
2021/08/28 09:25:05 ReceiveMessage failed: receive message failed: unexpected EOF
exit status 1

nicksanford commented 3 years ago

Thank you very much for the response!

You were right. This was user error. My apologies.

For others that may encounter this issue: I was executing both the example & postgres on osx & managing postgres with brew services. Turns out that when one stops postgres with brew services stop it doesn't kill all postgres child processes if there are still replication clients connected (such as the example).

➜  pglogrepl_demo git:(master) ✗ brew services list
Name          Status  User            Plist
postgresql    stopped
➜  pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford  50247   0.0  0.0  4268408    716 s002  S+   11:15AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
➜  pglogrepl_demo git:(master) ✗ brew services start postgres
==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)
➜  pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford  50531   0.0  0.0  4277624    736 s002  S+   11:15AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
nicholassanford  50504   0.0  0.0  4485552   1688   ??  Ss   11:15AM   0:00.00 postgres: logical replication launcher
nicholassanford  50503   0.0  0.0  4339732   1056   ??  Ss   11:15AM   0:00.00 postgres: stats collector
nicholassanford  50502   0.0  0.0  4485552   1980   ??  Ss   11:15AM   0:00.00 postgres: autovacuum launcher
nicholassanford  50501   0.0  0.0  4485360   1128   ??  Ss   11:15AM   0:00.00 postgres: walwriter
nicholassanford  50500   0.0  0.0  4485360   1176   ??  Ss   11:15AM   0:00.00 postgres: background writer
nicholassanford  50499   0.0  0.0  4485360   1080   ??  Ss   11:15AM   0:00.00 postgres: checkpointer
nicholassanford  50486   0.0  0.1  4485652  15652   ??  S    11:15AM   0:00.05 /usr/local/opt/postgresql/bin/postgres -D /usr/local/var/postgres
# start go example replication program
# tell brew to stop postgres, expecting the go program to terminate
➜  pglogrepl_demo git:(master) ✗ brew services stop postgres
Stopping `postgresql`... (might take a while)
==> Successfully stopped `postgresql` (label: homebrew.mxcl.postgresql)
# checking the child processes shows that not all child processes terminated, as the go replication client is still connected
➜  pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford  50609  55.9  0.0  4486664   5848   ??  Rs   11:15AM   0:04.60 postgres: walsender postgres ::1(54955) idle in transaction
nicholassanford  50901   0.0  0.0  4277624    732 s002  S+   11:15AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
nicholassanford  50499   0.0  0.0  4485360   1196   ??  Rs   11:15AM   0:00.00 postgres: checkpointer
# killing the go replication client allows the postgres child processes to terminate
➜  pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford  51255   0.0  0.0  4268408    708 s002  S+   11:16AM   0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres

Changing my testing to use postgres within a container (so that when the container terminates, all child processes are terminated) and is much more representative of a netsplit / postgres server termiantion in production, allowed the go program to termiante:

docker run -e POSTGRES_PASSWORD=changeme -p 5432:5432 postgres:13.4 -c wal_level=logical
# start go program 
# kill postgres docker container
# output from go program (same as @jackc 's comment above):
2021/08/28 11:17:49 Primary Keepalive Message => ServerWALEnd: 0/15C7118 ServerTime: 2021-08-28 11:17:50.038512 -0400 EDT ReplyRequested: true
2021/08/28 11:17:49 Sent Standby status message
2021/08/28 11:17:49 ReceiveMessage failed: receive message failed: unexpected EOF

espadolini commented 1 year ago

Just by running postgres in its official docker container and stopping the container, I get keepalive messages with ReplyRequested set for 10 seconds before the connection actually stops; pg_recvlogical doesn't exhibit the same behavior (it stops immediately, instead), in spite of the logic being pretty much the same between both. How can I diagnose this?

espadolini commented 1 year ago

And of course right as I post here I figure it out - at least on my Postgres 15 setup, you should acknowledge that you've written up to the ServerWALEnd upon receiving a keepalive message so that Postgres doesn't keep the connection open for longer than it's intended: https://github.com/postgres/postgres/blob/881cd9e581c2c40023f82edcd1cd2335691b78f2/src/bin/pg_basebackup/pg_recvlogical.c#L469C13-L469C13

Is this something that should be added to the example code?

jackc / pglogrepl

pglogrepl_demo: how to handle postgres server disconnect #20