Compactor may fail to delete chunks in SSD mode, marker files are only stored locally

slim-bean commented 1 year ago

Describe the bug

The retention (and delete process) is two phased, the compactor will scan the index for chunks which should be deleted because of expired retention, as well as delete requests.

It will store these chunks in local files in a markers/ directory for the second phase which reads these files and deletes the chunks from storage.

Because SSD mode does leader election to chose a single Read or Backend node to run the compactor, it's possible for the compactor node to change when there are unprocessed chunk deletes in the markers files, most likely these chunks will then never be deleted from object storage (unless that node becomes the leader elected compactor again)

The second phase process that deletes the chunks in marker files runs every minute, however, the setting retention_delete_delay determines when a marker file is processed.

The default for retention_delete_delay is 2hrs, which creates a 2 hour window where a marker file is created before the contents will be processed for deletes.

If the active compactor node in SSD changes permanently to a new node in this window, the chunks in those marker files will never be deleted.

Leader election is done by using the hash ring loki uses for most of it's distributed system operations, a ring is created for the compactors and shared via memberlist, each compactor only creates 1 token and inserts it into the ring, with 1 node, it would have 100% ownership of the ring, with 2 it would have 50% etc. Whichever node owns key 0 in the ring is elected leader.

Therefore leader changes are probabilistic, where a new leader would only be elected if a new node randomly generates a token that results in that node now owning key 0 in the ring when inserted.

Note: if you run a single compactor statefulset with persistent storage like in microservices mode, or just a single binary Loki, you would not be affected by this

Workarounds:

Enable TTL on chunks in object storage which exceeds your longest configured retention setting in Loki such that any chunks that are missed by the compactor can be cleaned up by object storage retention settings. NOTE if you go this route, try to figure out how you would remind yourself that you created this setting such that some day you don't decide to increase retention in Loki only to have the object storage delete your chunks anyway. Keep comments around your retention settings in your Loki config file as a reminder that the object storage is also enforcing retention for cleanup purposes.

slim-bean commented 1 year ago

Solutions:

marker files should go to object storage.

Less ideal solutions:

all compactors could run sweepers to make sure they cleanup any marker files they have even if they aren't the elected leader

duj4 commented 3 months ago

hi @slim-bean , is there any update on this issue? I am running Loki 3.1.0 in SSD mode (replicator 3), when I brought up the cluster, I noticed compactor related log as below:

backend-0:

backend-1:

backend-2:

Seems backend-1 has been chosen as the pod that runs compactor, but compactor service on backend-2 also starts after the "stop", may I know if this is normal?

grafana / loki

Compactor may fail to delete chunks in SSD mode, marker files are only stored locally #11119