Unmanaged replication slots being dropped by pg_auto_failover

hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability

Other

1.09k stars 114 forks source link

Unmanaged replication slots being dropped by pg_auto_failover #866

Open rheaton opened 2 years ago

rheaton commented 2 years ago

Hello!

I am not sure if this is a bug or desired behavior, but I would like to understand it either way.

If you've manually created a replication slot on the primary (with a name not matching the pg_auto_failover slot naming convention), this slot will be dropped during a failover of the primary.

If this is not intended behavior, please see these rough steps to reproduce with this branch at the bottom of this file.

Thank you! Rachel & @swati-nair

rheaton commented 2 years ago

@DimCitus If you have the time, can you take a look at this?

DimCitus commented 2 years ago

Hi @rheaton ; please see also https://github.com/citusdata/pg_auto_failover/issues/602.

rheaton commented 2 years ago

I'm a little confused, this seems to be the opposite issue, yeah? Or did we (pgaf) decide to delete all active replication slots during a failover?

DimCitus commented 2 years ago

Yeah I'm as confused as you are. I mean, I suppose you're in front of an actual bug here, as in it's not the intended behaviour and I could remember and then find an issue that shows our intentions / expectations. I think our approach in https://github.com/citusdata/pg_auto_failover/blob/93f995812dffef29695f6f0ce63c5e69f5a34a74/src/bin/pg_autoctl/pgsql.c#L1650-L1656 (and possibly other places) is shy of a brick load, as they would say on -hackers. Can you think of a better way to approach it?

DimCitus commented 2 years ago

That said our pattern is "^pgautofailoverstandby" ; who would use such a pattern in their own replication slots, that they want pg_auto_failover to refrain from handling? ;-)

rheaton commented 2 years ago

Yeah, that's why I am surprised by the behavior -- my slot is called bespoke_slot so I am not sure how it's getting deleted!

DimCitus commented 2 years ago

Reviewing the code again, I just read the following, which might explain. Could you dive into it?

https://github.com/citusdata/pg_auto_failover/blob/93f995812dffef29695f6f0ce63c5e69f5a34a74/src/bin/pg_autoctl/pgsql.c#L1527-L1529

rheaton commented 2 years ago

@DimCitus I've got a lot on my plate right now, but I will take a look in the first week of march! :)

rheaton commented 2 years ago

Hi @dimcitus. @jchampio and I spent some time digging in to what is happening.

We observed that the pg_drop_replication_slot function was not being called on any slots not maintained by pg_auto_failover, so the query (though complicated) was not the culprit for the slot deletion. Additionally, the slot did not get removed every time there was a failover/switchover. The slot is only removed when pg_rewind determines that it has something to do. It ends up clearing the contents of a variety of folders, including pg_replslot/ which is how the slot is deleted.

We wouldn't consider this a pg_auto_failover bug, except for the statement in the other issue that the slots would be maintained. It might be a nice enhancement to the existing feature-set for pgaf to optionally maintain slots. An example of this type of behavior is in Patroni (see the slots: key). Their documentation isn't very easy to follow, but essentially they store slots to maintain across a switchover in their DCS.

DimCitus commented 2 years ago

Hi,

Thanks for the investigation! It seems like it's a bug in Postgres / pg_rewind then. I wonder if the clearing of pg_repslot/ is necessary, and then done in a smart way. I suppose that if pg_rewind ends up in the past from some of those slots, then the slots are “corrupted” in a way, because set in a future that won't exist anymore.

So how does Patroni handles the idea that the same LSN position is going to refer to a possibly different history and WAL changes on the new timeline after a rewind?

In the context of pg_auto_failover can we just document the fact and call it a bug in Postgres, and then maybe grab some Postgres Core contributor interest to fix it? Do you think we need to register all the slots on the monitor and then update them once in a while, maybe with the existing protocol, and then implement a facility to re-install the slots after a pg_rewind then? Or something else entirely maybe... what would be your favorite way to handle the situation?