grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.86k stars 3.44k forks source link

Compactor may fail to delete chunks in SSD mode, marker files are only stored locally #11119

Open slim-bean opened 1 year ago

slim-bean commented 1 year ago

Describe the bug

The retention (and delete process) is two phased, the compactor will scan the index for chunks which should be deleted because of expired retention, as well as delete requests.

It will store these chunks in local files in a markers/ directory for the second phase which reads these files and deletes the chunks from storage.

Because SSD mode does leader election to chose a single Read or Backend node to run the compactor, it's possible for the compactor node to change when there are unprocessed chunk deletes in the markers files, most likely these chunks will then never be deleted from object storage (unless that node becomes the leader elected compactor again)

The second phase process that deletes the chunks in marker files runs every minute, however, the setting retention_delete_delay determines when a marker file is processed.

The default for retention_delete_delay is 2hrs, which creates a 2 hour window where a marker file is created before the contents will be processed for deletes.

If the active compactor node in SSD changes permanently to a new node in this window, the chunks in those marker files will never be deleted.

Leader election is done by using the hash ring loki uses for most of it's distributed system operations, a ring is created for the compactors and shared via memberlist, each compactor only creates 1 token and inserts it into the ring, with 1 node, it would have 100% ownership of the ring, with 2 it would have 50% etc. Whichever node owns key 0 in the ring is elected leader.

Therefore leader changes are probabilistic, where a new leader would only be elected if a new node randomly generates a token that results in that node now owning key 0 in the ring when inserted.

Note: if you run a single compactor statefulset with persistent storage like in microservices mode, or just a single binary Loki, you would not be affected by this

Workarounds:

slim-bean commented 1 year ago

Solutions:

Less ideal solutions:

duj4 commented 3 months ago

hi @slim-bean , is there any update on this issue? I am running Loki 3.1.0 in SSD mode (replicator 3), when I brought up the cluster, I noticed compactor related log as below:

backend-0: image

backend-1: image

backend-2: image

Seems backend-1 has been chosen as the pod that runs compactor, but compactor service on backend-2 also starts after the "stop", may I know if this is normal?