Log deletion does not work

mac133k commented 6 months ago

Describe the bug 2 deletion requests were submitted to Loki delete API. The requests were received and GET requests return their states as 'received', but that has not changed for 5 days now and the logs in question remain searchable. There have been no mentions of the deletion requests in Loki compactor logs since the submission confirmation.

To Reproduce Submit the deletion request to Loki API using curl. Check Loki logs and API for the status. Wait... Search for logs.

Expected behavior The logs were expected to at least be removed from search results, but ideally from S3 storage too.

Environment:

Linux VMs, micro-services mode

Deployment tool: Ansible

Loki version: 2.9.1

Loki compactor and limits config blocs:

compactor:
compaction_interval: 10m
retention_delete_delay: 24h
retention_enabled: true
working_directory: /data1/loki/compactor
limits_config:
cardinality_limit: 1000
deletion_mode: filter-and-delete
retention_period: 5y

Storage backend: on-prem S3 (Ceph)
Storage schema: v12 TSDB
IndexGWs are active in the cluster.

Please let me know if you need more information.

mhulscher commented 6 months ago

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:

curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'

mac133k commented 6 months ago

We also submit delete requests with curl and then they show up as received, but none ever really worked in our PROD cluster. I tried deleting logs in the DEV cluster which is much smaller in terms of the number of hosts or the volume of logs ingested and the deletion worked only partially, ie. when I requested deletion of logs over the period of 24h I could see 8 gaps 1-2h wide when I reran the query a few minutes later. Then there had been no changes to the delete request state or the target data over the following days.

If anyone has ideas how to investigate this please suggest.

mac133k commented 6 months ago

One thing that stands out is that a delete request can be in a received state for days or weeks without any further action. If there was problem with the delete query or no logs could be found for deletion there should be an update or a change of state.

Starefossen commented 4 months ago

Seeing similar issue in our cluster as well, requests are received but no logs are being removed.

mac133k commented 3 months ago

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:
curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'

@mhulscher The logs you successfully deleted - were they saved in a chunk store or still in ingesters' RAM? Also in your Loki cluster was the chunk store set up on a local FS or external S3 service?

hieunguyen847 commented 3 months ago

I have the same issue in s3 service

iamjvn commented 2 months ago

We also have a need to delete log entries. Please consider prioritizing this bug resolution.

billmoling commented 1 month ago

Hi @sandeepsukhani and @MichelHollands,

I looked into the compactor/deletion code and found that you two are the major contributors. I'm tagging both of you in the hope that you could have a quick look at this issue and provide some insights if possible.

I also noticed that most of the pull requests related to the boltdb-store shipper were made about three years ago. Has this deletion feature been updated with TSDB since then?

Thank you!

jakubsikorski commented 2 weeks ago

We are also observing the same - log deletion request is pending but never done and no logs are deleted. Any update on this?

mac133k commented 1 week ago

Here is an interesting case: I recently discovered in one of our Loki clusters that there was a request for deleting logs dated Jan 29 through Feb 6 submitted to compactor API on Mar 28 that got processed on Jun 28 out of the sudden. Looking into the logs there is no clear indication why the delete request was triggered on that particular day 86 days after submission and 145 days from the start of the deletion time range:

compactor startup delay completed
applying retention with compaction
compactor started
overall smallest retention period 1716033289.014, default smallest retention period 1716033289.014
followed by a series of "caller=marker.go:177 msg="mark processor started" workers=150 delay=24h0m0s"
then these started appearing: "caller=delete_requests_manager.go:136 msg="Started processing delete request for user" delete_request_id=3dc1912b user=***"
concluded by: "caller=delete_requests_manager.go:214 msg="Processing 70 of 197 delete requests. More requests will be processed in subsequent compactions""
then there were a few batches of: "caller=delete_requests_manager.go:328 msg="delete request for user marked as processed" delete_request_id=3dc1912b sequence_num=NNN user=*** deleted_lines=XXX" on that same day (Jun 22) across the period of about 8 hours.

That day (Jun 22) compactors were processing tables from days dated back to the end of Jan and beginning of Feb, however the first reference to those index tables appeared in the logs about 1 hour after the first message "Started processing delete request for user".

I am confused by my findings so far, but I can dig into it more if someone gives me hints.

grafana / loki

Log deletion does not work #11926