linkedin / ambry

Distributed object store
https://github.com/linkedin/ambry/wiki
Apache License 2.0
1.74k stars 275 forks source link

Filter out offline peers when checking delete tombstones #2776

Closed justinlin-linkedin closed 4 months ago

justinlin-linkedin commented 4 months ago

Summary

When a peer went offline for a long time (X days), we should not consider its replication token when checking if a delete tombstone is valid or not.

Reason

The way we remove delete tombstone is fast, it keeps all the remote replication tokens from its peers, and compare offsets from tokens with the offset of the delete tombstone. This approach requires only local information so it's very fast, but it has some issues. When a peer went offline for a long time, it's remote replication token would not move at all. After a couple compactions, this peer's token offset would be less than most of delete tombstones' offsets. Thus all the delete tombstones can't compacted any more.

Change

Filter those offline replicas out. Making sure that we enable the remote replication token persistor so we would persist remote tokens every few hours(or minutes). This would give us a good picture on when was the last time remote peer was active.

If this active timestamp is longer than X days(configurable), we just ignore this peer.

Test

Added unit test

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 86.20690% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 67.34%. Comparing base (52ba813) to head (d9fe138). Report is 3 commits behind head on master.

Files Patch % Lines
...ava/com/github/ambry/store/BlobStoreCompactor.java 76.47% 3 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #2776 +/- ## ============================================ + Coverage 64.24% 67.34% +3.10% - Complexity 10398 11182 +784 ============================================ Files 840 840 Lines 71755 71807 +52 Branches 8611 8620 +9 ============================================ + Hits 46099 48360 +2261 + Misses 23004 20863 -2141 + Partials 2652 2584 -68 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

justinlin-linkedin commented 4 months ago

When the ignoring peer token happens, we need some visibility, either log or metrics are good for me. I'm approving this PR. Please add the metrics either in this PR or next PR.

Added logs in the new commit