linkedin / venice

Venice, Derived Data Platform for Planet-Scale Workloads.
https://venicedb.org
BSD 2-Clause "Simplified" License
487 stars 84 forks source link

[admin-tool][server] Add new admin-tool command to dump heartbeat status from server; Scan and log stale HB replica #1275

Open sixpluszero opened 2 days ago

sixpluszero commented 2 days ago

[admin-tool][server] Add new admin-tool command to dump heartbeat status from server; Scan and log stale HB replica

Previous PR #1260 is too complex, this PR only focus on HB related usability improvements

This PR adds two features related to heartbeat:

  1. Add a heartbeat scan thread to periodically run and log lagging resources (every minute by default, this should be good enough not to spam logging). This can be further collected by other logging collecting system and we can easily detect on which host, what replica is lagging by how much.
  2. Add a command to dump heartbeat status from a host. It has 3 optional filter: topic filter, partition filter and lag filter. You can choose to see only specific topic / topic-partition or you can choose to only see resources that are lagging. This serves as the manual helper when (1) might be missing stuff.

How was this PR tested?

Added new integration test

Does this PR introduce any user-facing changes?