Open slim-bean opened 1 year ago
Solutions:
Less ideal solutions:
hi @slim-bean , is there any update on this issue? I am running Loki 3.1.0 in SSD mode (replicator 3), when I brought up the cluster, I noticed compactor related log as below:
backend-0:
backend-1:
backend-2:
Seems backend-1 has been chosen as the pod that runs compactor, but compactor service on backend-2 also starts after the "stop", may I know if this is normal?
Describe the bug
The retention (and delete process) is two phased, the compactor will scan the index for chunks which should be deleted because of expired retention, as well as delete requests.
It will store these chunks in local files in a
markers/
directory for the second phase which reads these files and deletes the chunks from storage.Because SSD mode does leader election to chose a single Read or Backend node to run the compactor, it's possible for the compactor node to change when there are unprocessed chunk deletes in the markers files, most likely these chunks will then never be deleted from object storage (unless that node becomes the leader elected compactor again)
The second phase process that deletes the chunks in marker files runs every minute, however, the setting
retention_delete_delay
determines when a marker file is processed.The default for
retention_delete_delay
is 2hrs, which creates a 2 hour window where a marker file is created before the contents will be processed for deletes.If the active compactor node in SSD changes permanently to a new node in this window, the chunks in those marker files will never be deleted.
Leader election is done by using the hash ring loki uses for most of it's distributed system operations, a ring is created for the compactors and shared via memberlist, each compactor only creates 1 token and inserts it into the ring, with 1 node, it would have 100% ownership of the ring, with 2 it would have 50% etc. Whichever node owns key 0 in the ring is elected leader.
Therefore leader changes are probabilistic, where a new leader would only be elected if a new node randomly generates a token that results in that node now owning key 0 in the ring when inserted.
Note: if you run a single compactor statefulset with persistent storage like in microservices mode, or just a single binary Loki, you would not be affected by this
Workarounds: