Open jfinzel opened 7 years ago
@MasahikoSawada I think this problem should be addressed in next release. I had too a logical replication breakage using pg_repack and we had to drop replications and resync tables sort-of-manually.
I can try to take a look into it, at least to try to reproduce. Help from @jfinzel or other people would be welcome.
I think this problem should be addressed in next release. I agreed.
@dvarrazzo what precise issue did you have? I have actually used pg_repack successfully on lots of tables in replication with pglogical. But I found that the issue was specifically when the walsender hits a timeout and I get walreceiver terminating due to replication timeout
.
I am wondering if I removed the wal sender/receiver timeout for our replication role, I would never have the issue again. I can tell you for a fact that pg_repack seems to usually work with logical decoding in our experience. So I don't understand what the precise problem really is, but please let me know if this explanation sheds any light on the issue.
I had two logical replication slot streaming, and they both stopped roughly at the time someone ran pg_repack on one of the replicated tables. The slots just stopped sending data, pg_logical_slot_peek_changes()
on them didn't terminate; newly created replication slots worked ok so I had to drop and recreate the previous ones and recover manually the missing data.
Unfortunately I didn't check the server log details. My client was just hung as the peek_changes function (maybe that caused a timeout for you?): if we manage to reproduce the issue maybe we can take a look there.
The sending server is a postgres 9.4.14; the receiving is a 9.6 but it shouldn't matter.
Interesting. Again, I have repacked busy tables that are in logical replication without issue. I can't tell you why that normally works. I only had breakage when repacking a table that had several 100 millions rows, and after 3+ hours, I got the walreceiver terminating due to replication timeout
in the server logs, which was the precise time the logical decoding broke.
@dvarrazzo Have you had time to address the issue with pg_repack breaking logical replication?
@bryanconlon not much, no
I mostly use pg_repack to rebuild indexes. Is this issue related to repacking tables only or does it also affect index rebuilds?
@dvarrazzo @bryanconlon see my new comments on https://github.com/2ndQuadrant/pglogical/issues/113 - I'm curious your opinion about this.
My opinion is now about 2 years old: I had then cases of replication interrupting, and the only way to restore it, was to drop and recreate the replication slot. The project using replication hasn't gone further (part because of this problems, but also part because of different business requirements and constraints).
That might have been related to postgres bugs that have been solved since (might have been a nowadays-old 9.4 thing?). I don't have a more recent set of experiences to share, and I'm happy to not be too superstitious about it. However the function pg_repack use to swap relfilenode
has not had all the updates that the current postgres server support and I'm not comfortable at the moment that all the necessary bookkeeping is done (see #152).
Thanks for the reply. Do you have any idea if your issues you had before were correlated with a single high WAL-generating transaction?
We are running pglogical replication from a 9.4 to 9.6 cluster, and we had an issue last couple days running pg_repack on several provider tables that are in replication. We have done pg_repack before in this way but not had an issue, however this time some larger tables were involved.
I posted this issue on pglogical's git repo here: https://github.com/2ndQuadrant/pglogical/issues/113
According to PJMODOS:
Thoughts?