replicase / pgcapture

A scalable Netflix DBLog implementation for PostgreSQL
Apache License 2.0
220 stars 31 forks source link

fix: report period keepalive message #59

Closed KennyChenFight closed 5 months ago

KennyChenFight commented 5 months ago

This solution is primarily designed to address the following issue: When an instance of PostgreSQL has multiple databases in use, if one of the databases has a low write traffic and also has a replication slot, it can cause the position of this slot to be unable to advance. This leads to a buildup of WAL logs that cannot be cleared, thereby affecting the entire PostgreSQL instance.

For example: image If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space.

Common solutions include:

  1. Create heart_beats table and writing regularly.
  2. Using pg_logical_emit_message without needing an extra table.
  3. Using PostgreSQL's regular keepalive messages to the client, which include the ServerWALEnd for the entire PostgreSQL instance.

I believe that solution 3 is simpler and more elegant. However, since keepalive messages are very frequent, by setting the ReportLSNThreshold, we can ensure that when the replication lag reaches a certain threshold, then PGXSource would pass the corresponding keepalive message to PulsarSink, which will then report back to PostgreSQL.

KennyChenFight commented 5 months ago

@benjamin99 After discussion with @rueian, the normal keepalive message period is about 5 seconds. And Usually, if there are many changes waiting to be consumed in the slot, there won't be any keepalive messages during this time. So, we can try the version without a threshold to see if the replication lag will intensify.

rueian commented 5 months ago

LGTM