linkedin / ambry

Distributed object store
https://github.com/linkedin/ambry/wiki
Apache License 2.0
1.74k stars 275 forks source link

[vcr-2.0] Fix VCR eager-delete policy #2860

Closed snalli closed 1 month ago

snalli commented 1 month ago

Encountered several BLOB_NOT_FOUND errors after enabling eager-delete policy. There is no data loss, but the logs are getting flooded to the point where there is no space left on disk. There are two possible reasons:

  1. Each replica-thread has a thread-local cache where it stores metadata of a blob requested from Azure. Once it deletes the blob, it moves on without clearing the cache. When it encounters the same blob in another replica, the thread thinks the blob is present in azure after looking at the cache and tries to delete the blob. At this point, it receives a BLOB_NOT_FOUND from Azure and floods the logs with giant stack traces.
  2. The other possibility is a race with the background compaction thread, but a bit unlikely given that its happening consistently across all hosts and threads.

The fix is to clear the cache when a thread updates the metadata of the blob while replicating a replica. If it encounters the same blob in another replica of the partition, then it starts clean. We may have a small performance hit due to cache misses, but we guarantee some isolation between replicas by clearing off any saved state in cache.

Tested overnight in corp-cluster. No errors observed.