Closed KennyChenFight closed 5 months ago
@benjamin99 After discussion with @rueian, the normal keepalive message period is about 5 seconds. And Usually, if there are many changes waiting to be consumed in the slot, there won't be any keepalive messages during this time. So, we can try the version without a threshold to see if the replication lag will intensify.
LGTM
This solution is primarily designed to address the following issue: When an instance of PostgreSQL has multiple databases in use, if one of the databases has a low write traffic and also has a replication slot, it can cause the position of this slot to be unable to advance. This leads to a buildup of WAL logs that cannot be cleared, thereby affecting the entire PostgreSQL instance.
For example: If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space.
Common solutions include:
I believe that solution 3 is simpler and more elegant. However, since keepalive messages are very frequent, by setting the ReportLSNThreshold, we can ensure that when the replication lag reaches a certain threshold, then PGXSource would pass the corresponding keepalive message to PulsarSink, which will then report back to PostgreSQL.