Open diranged opened 6 years ago
I've found where the bug lies.. and have a partial fix. The bug is that if handleReplicationMsg()
returns an error, it immediately halts the replicationLoop()
goroutine. When that is halted, and because the replicationMessages
channel has size: 0
, the main connectReplicateLoop()
routine halts when trying to insert the next message into the channel.
The failure process looks like this ...
connectReplicateLoop()
starts up and pushes replicationMessages <- msg1
replicationLoop()
was already up and waiting for a messagereplicationLoop()
processes msg1
and sees its too big (> 1mb message), so immediately returns
connectReplicateLoop()
has already run back around in a circle though and is trying to push replicationMessages <- msg2
diff --git a/pg_kinesis.go b/pg_kinesis.go
index ebbfdf1..f982340 100644
--- a/pg_kinesis.go
+++ b/pg_kinesis.go
@@ -458,6 +458,8 @@ func replicationLoop(replicationMessages chan *pgx.ReplicationMessage, replicati
if err != nil {
replicationFinished <- err // already wrapped
+ logf("error received, dumped into replicationFinished: %s", err)
+ logf("replicationFinished size: %d", len(replicationFinished))
return
}
}
@@ -619,6 +621,7 @@ func fetchLag(slot *string, conn *pgx.Conn) error {
func pkeyLagLoop(slot *string, nonReplConn *pgx.Conn, replicationFinished chan error) {
for !done.IsSet() {
+ logf("channel length: %d", len(replicationFinished))
err := fetchPKs(nonReplConn)
if err != nil {
replicationFinished <- err
@@ -694,28 +697,34 @@ func connectReplicateLoop(slot *string, sourceConfig pgx.ConnConfig, stream *str
return errors.Wrap(err, "waiting for replication message failed")
}
- // check if the replicating goroutine died
- select {
- case replErr = <-replicationFinished:
- default:
- }
-
- if replErr != nil {
- logf("stopping replication due to replication goroutine failure")
- return replErr // already wrapped
- }
if message != nil {
if message.WalMessage != nil {
// this is not exactly the server time
// but we are taking over this field as PG does not send it down
message.WalMessage.ServerTime = uint64(now.UnixNano())
- replicationMessages <- message
+
+ // check if the replicating goroutine died
+ select {
+ case replicationMessages <- message:
+ case replErr = <-replicationFinished:
+ }
+
+ logf("done")
} else if message.ServerHeartbeat != nil {
keepaliveRequested = message.ServerHeartbeat.ReplyRequested == 1
}
}
+
+ if replErr != nil {
+ logf("stopping replication due to replication goroutine failure")
+ return replErr // already wrapped
+ }
+
err = sendKeepalive(conn, keepaliveRequested)
if err != nil {
return errors.Wrap(err, "unable to send keepalive")
The problem is that even with this patch, things aren't great. As soon as the connectReplicateLoop()
dies off, the main()
loop sleeps, then restarts (and there is a debate on whether or not that should happen). The problem is that when it restarts, it starts the conenctReplicateLoop()
creates all new output channels, including new ones to send more data to Kinesis. At this point, you end up with two different pkeyLagLoop()
routines running!
At this point, everything is hosed.. there are multiple functions running trying to write to common global variables (pks.Store...
and the stats
struct). You also, for some reason that I have not yet figured out, can only run through the connectReplicateLoop()
once before it deadlocks when trying to add a message to its new replicationMessages
channel.
It seems like at this point, the app just needs to die?
Just a followup, if you set done.SetTo(true)
on failure, then at least the code exits cleanly:
diff --git a/pg_kinesis.go b/pg_kinesis.go
index ebbfdf1..4c8d32e 100644
--- a/pg_kinesis.go
+++ b/pg_kinesis.go
@@ -694,28 +694,31 @@ func connectReplicateLoop(slot *string, sourceConfig pgx.ConnConfig, stream *str
return errors.Wrap(err, "waiting for replication message failed")
}
- // check if the replicating goroutine died
- select {
- case replErr = <-replicationFinished:
- default:
- }
-
- if replErr != nil {
- logf("stopping replication due to replication goroutine failure")
- return replErr // already wrapped
- }
if message != nil {
if message.WalMessage != nil {
// this is not exactly the server time
// but we are taking over this field as PG does not send it down
message.WalMessage.ServerTime = uint64(now.UnixNano())
- replicationMessages <- message
+
+ // check if the replicating goroutine died
+ select {
+ case replicationMessages <- message:
+ case replErr = <-replicationFinished:
+ }
+
} else if message.ServerHeartbeat != nil {
keepaliveRequested = message.ServerHeartbeat.ReplyRequested == 1
}
}
+
+ if replErr != nil {
+ logf("stopping replication due to replication goroutine failure while processing: %s", message)
+ done.SetTo(true)
+ return replErr // already wrapped
+ }
+
err = sendKeepalive(conn, keepaliveRequested)
if err != nil {
return errors.Wrap(err, "unable to send keepalive")
2018-05-02T20:36:33Z reading target DB configuration from shell environment
/root/pg_kinesis/src/github.com/diranged/pg_kinesis/pg_kinesis.go : 859 - unable to create slot pg_kinesis: ERROR: replication slot "pg_kinesis" already exists (SQLSTATE 42710)
2018-05-02T20:36:33Z replication slot pg_kinesis already exists, continuing
2018-05-02T20:36:33Z replication starting from LSN 0/0
2018-05-02T20:36:34Z stopping replication due to replication goroutine failure while processing: &{Wal: 3C13/EC317180 Time: 1816-08-21 04:48:58.988617872 +0000 UTC Lag: 0 <nil>}
/root/pg_kinesis/src/github.com/diranged/pg_kinesis/pg_kinesis.go : 887 - replication messages must be less than 1MB in size
Any updates on this?
@taybin, check out https://github.com/Nextdoor/pg-bifrost.. we ended up writing our own tool for this, and we’re running this as our primary Postgres streaming method now in production at scale.
@diranged Oh cool. I mostly wanted to modify this to add support for writing to RabbitMQ, but I'll take a look at pg-bifrost for that as well. It looks pretty well modularized, so maybe I can get a clean PR of that together.
Open an issue on our project and we can talk through it. We have designed it to be pluggable (we’re considering a direct-to-S3 transport).
First off, this tool is great .. and I really want to see it succeed... so with that said, we're testing it out and finding that after it runs for a while, it deadlocks in our environment. This is the second time we've seen this in 24 hrs.
At this point, its been locked like this for ~10 hours .. so its clearly dead. It keeps outputting the status line, but nothing else. Any suggestions on how to troubleshoot?