Closed nicksanford closed 3 years ago
Can you double check that the server is really stopped? This is what I get when I stop my PostgreSQL server.
2021/08/28 09:24:51 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:24:51.88443 -0500 CDT ReplyRequested: false
2021/08/28 09:24:51 Sent Standby status message
2021/08/28 09:24:51 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:24:51.990202 -0500 CDT ReplyRequested: false
2021/08/28 09:25:01 Sent Standby status message
2021/08/28 09:25:01 Primary Keepalive Message => ServerWALEnd: 28/3801B850 ServerTime: 2021-08-28 09:25:01.995287 -0500 CDT ReplyRequested: false
2021/08/28 09:25:05 ReceiveMessage failed: receive message failed: unexpected EOF
exit status 1
Thank you very much for the response!
You were right. This was user error. My apologies.
For others that may encounter this issue:
I was executing both the example & postgres on osx & managing postgres with brew services
. Turns out that when one stops postgres with brew services stop
it doesn't kill all postgres child processes if there are still replication clients connected (such as the example).
➜ pglogrepl_demo git:(master) ✗ brew services list
Name Status User Plist
postgresql stopped
➜ pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford 50247 0.0 0.0 4268408 716 s002 S+ 11:15AM 0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
➜ pglogrepl_demo git:(master) ✗ brew services start postgres
==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)
➜ pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford 50531 0.0 0.0 4277624 736 s002 S+ 11:15AM 0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
nicholassanford 50504 0.0 0.0 4485552 1688 ?? Ss 11:15AM 0:00.00 postgres: logical replication launcher
nicholassanford 50503 0.0 0.0 4339732 1056 ?? Ss 11:15AM 0:00.00 postgres: stats collector
nicholassanford 50502 0.0 0.0 4485552 1980 ?? Ss 11:15AM 0:00.00 postgres: autovacuum launcher
nicholassanford 50501 0.0 0.0 4485360 1128 ?? Ss 11:15AM 0:00.00 postgres: walwriter
nicholassanford 50500 0.0 0.0 4485360 1176 ?? Ss 11:15AM 0:00.00 postgres: background writer
nicholassanford 50499 0.0 0.0 4485360 1080 ?? Ss 11:15AM 0:00.00 postgres: checkpointer
nicholassanford 50486 0.0 0.1 4485652 15652 ?? S 11:15AM 0:00.05 /usr/local/opt/postgresql/bin/postgres -D /usr/local/var/postgres
# start go example replication program
# tell brew to stop postgres, expecting the go program to terminate
➜ pglogrepl_demo git:(master) ✗ brew services stop postgres
Stopping `postgresql`... (might take a while)
==> Successfully stopped `postgresql` (label: homebrew.mxcl.postgresql)
# checking the child processes shows that not all child processes terminated, as the go replication client is still connected
➜ pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford 50609 55.9 0.0 4486664 5848 ?? Rs 11:15AM 0:04.60 postgres: walsender postgres ::1(54955) idle in transaction
nicholassanford 50901 0.0 0.0 4277624 732 s002 S+ 11:15AM 0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
nicholassanford 50499 0.0 0.0 4485360 1196 ?? Rs 11:15AM 0:00.00 postgres: checkpointer
# killing the go replication client allows the postgres child processes to terminate
➜ pglogrepl_demo git:(master) ✗ ps aux | grep postgres
nicholassanford 51255 0.0 0.0 4268408 708 s002 S+ 11:16AM 0:00.00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn postgres
Changing my testing to use postgres within a container (so that when the container terminates, all child processes are terminated) and is much more representative of a netsplit / postgres server termiantion in production, allowed the go program to termiante:
docker run -e POSTGRES_PASSWORD=changeme -p 5432:5432 postgres:13.4 -c wal_level=logical
# start go program
# kill postgres docker container
# output from go program (same as @jackc 's comment above):
2021/08/28 11:17:49 Primary Keepalive Message => ServerWALEnd: 0/15C7118 ServerTime: 2021-08-28 11:17:50.038512 -0400 EDT ReplyRequested: true
2021/08/28 11:17:49 Sent Standby status message
2021/08/28 11:17:49 ReceiveMessage failed: receive message failed: unexpected EOF
Just by running postgres in its official docker container and stopping the container, I get keepalive messages with ReplyRequested set for 10 seconds before the connection actually stops; pg_recvlogical doesn't exhibit the same behavior (it stops immediately, instead), in spite of the logic being pretty much the same between both. How can I diagnose this?
And of course right as I post here I figure it out - at least on my Postgres 15 setup, you should acknowledge that you've written up to the ServerWALEnd upon receiving a keepalive message so that Postgres doesn't keep the connection open for longer than it's intended: https://github.com/postgres/postgres/blob/881cd9e581c2c40023f82edcd1cd2335691b78f2/src/bin/pg_basebackup/pg_recvlogical.c#L469C13-L469C13
Is this something that should be added to the example code?
Extremely cool project! Thank you very much for creating it.
I'm a go beginner & I'm trying to learn by using your library to create a toy version of debezium that serializes the logical replication messages to json & writes them to stdout. The demo made it very approachable & easy to get started so thank you again! Its been a fantastic way to learn go.
I've got most of it working except the case when the connection with postgres is broken and I'm not sure how to make forward progress.
If this is the wrong forum to ask this question please let me know.
My desired behavior is that if my program looses connection to the postgres server, I'd like the program to terminate with an error (just like the demo does if it can't connect in the first place). I'm simulating this by simply shutting down the postgres server I'm connected to after the program has successfully connected to the postgres server.
The problem I'm running in to is that I can't figure out how to determine if the connection with postgres is still alive or not and I'm confused because it appears that calls to
conn.ReceiveMessage
are returning successfully immediately, even if the postgres server shut down after connection.I'm observing is that if the postgess server shuts down after the demo has connected, the call to
conn.ReceiveMessage
doesn't return an error (which is what I, probably ignorantly, would have expected), instead it returns immediately with aPrimaryKeepaliveMessage
https://github.com/jackc/pglogrepl/blob/master/example/pglogrepl_demo/main.go#L75. Logs from the demo look like:Then when I shut down the postgres server the following log is output over & over as the
nextStandbyMessageDeadline
is always in the past &conn.ReceiveMessage
always returns immediately with no error (I don't understand why):So my questions are:
conn.ReceiveMessage
returning a result when the postgres server it is requesting to receive messages from is no longer alive to send messages?I'm using the following versions: