fix: report period keepalive message

KennyChenFight commented 5 months ago

This solution is primarily designed to address the following issue: When an instance of PostgreSQL has multiple databases in use, if one of the databases has a low write traffic and also has a replication slot, it can cause the position of this slot to be unable to advance. This leads to a buildup of WAL logs that cannot be cleared, thereby affecting the entire PostgreSQL instance.

For example: If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space.

Common solutions include:

Create heart_beats table and writing regularly.
Using pg_logical_emit_message without needing an extra table.
Using PostgreSQL's regular keepalive messages to the client, which include the ServerWALEnd for the entire PostgreSQL instance.

I believe that solution 3 is simpler and more elegant. However, since keepalive messages are very frequent, by setting the ReportLSNThreshold, we can ensure that when the replication lag reaches a certain threshold, then PGXSource would pass the corresponding keepalive message to PulsarSink, which will then report back to PostgreSQL.

KennyChenFight commented 5 months ago

@benjamin99 After discussion with @rueian, the normal keepalive message period is about 5 seconds. And Usually, if there are many changes waiting to be consumed in the slot, there won't be any keepalive messages during this time. So, we can try the version without a threshold to see if the replication lag will intensify.

rueian commented 5 months ago

LGTM

replicase / pgcapture

fix: report period keepalive message #59