grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.21k stars 3.36k forks source link

Log deletion does not work #11926

Open mac133k opened 6 months ago

mac133k commented 6 months ago

Describe the bug 2 deletion requests were submitted to Loki delete API. The requests were received and GET requests return their states as 'received', but that has not changed for 5 days now and the logs in question remain searchable. There have been no mentions of the deletion requests in Loki compactor logs since the submission confirmation.

To Reproduce Submit the deletion request to Loki API using curl. Check Loki logs and API for the status. Wait... Search for logs.

Expected behavior The logs were expected to at least be removed from search results, but ideally from S3 storage too.

Environment:

Please let me know if you need more information.

mhulscher commented 6 months ago

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:

curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'
mac133k commented 6 months ago

We also submit delete requests with curl and then they show up as received, but none ever really worked in our PROD cluster. I tried deleting logs in the DEV cluster which is much smaller in terms of the number of hosts or the volume of logs ingested and the deletion worked only partially, ie. when I requested deletion of logs over the period of 24h I could see 8 gaps 1-2h wide when I reran the query a few minutes later. Then there had been no changes to the delete request state or the target data over the following days.

If anyone has ideas how to investigate this please suggest.

mac133k commented 6 months ago

One thing that stands out is that a delete request can be in a received state for days or weeks without any further action. If there was problem with the delete query or no logs could be found for deletion there should be an update or a change of state.

Starefossen commented 4 months ago

Seeing similar issue in our cluster as well, requests are received but no logs are being removed.

mac133k commented 3 months ago

Using TSDB, we were able to delete logs by performing a curl to localhost, directly from the compactor pod:

curl -g -X PUT 'http://localhost:3100/loki/api/v1/delete?query={namespace="foo"}&start=1706140800&end=1707916080' -H 'X-Scope-OrgID:global'

@mhulscher The logs you successfully deleted - were they saved in a chunk store or still in ingesters' RAM? Also in your Loki cluster was the chunk store set up on a local FS or external S3 service?

hieunguyen847 commented 3 months ago

I have the same issue in s3 service

iamjvn commented 2 months ago

We also have a need to delete log entries. Please consider prioritizing this bug resolution.

billmoling commented 1 month ago

Hi @sandeepsukhani and @MichelHollands,

I looked into the compactor/deletion code and found that you two are the major contributors. I'm tagging both of you in the hope that you could have a quick look at this issue and provide some insights if possible.

I also noticed that most of the pull requests related to the boltdb-store shipper were made about three years ago. Has this deletion feature been updated with TSDB since then?

Thank you!

jakubsikorski commented 2 weeks ago

We are also observing the same - log deletion request is pending but never done and no logs are deleted. Any update on this?

mac133k commented 1 week ago

Here is an interesting case: I recently discovered in one of our Loki clusters that there was a request for deleting logs dated Jan 29 through Feb 6 submitted to compactor API on Mar 28 that got processed on Jun 28 out of the sudden. Looking into the logs there is no clear indication why the delete request was triggered on that particular day 86 days after submission and 145 days from the start of the deletion time range:

That day (Jun 22) compactors were processing tables from days dated back to the end of Jan and beginning of Feb, however the first reference to those index tables appeared in the logs about 1 hour after the first message "Started processing delete request for user".

I am confused by my findings so far, but I can dig into it more if someone gives me hints.