Loki Block Disk Space Cleanup Proposal

Abuelodelanada commented 2 years ago

Is your feature request related to a problem? Please describe.

Currently, Loki users who are not using object-based storage may accidentally fill all available space on a disk or PVC with logs on block-based storage. This is especially common with Loki running in small adhoc k8s clusters like microk8s.

Loki provides endpoints for deleting logs based on dates or queries, but no “automatic” method for ensuring that block storage remains within a defined threshold.

If block storage fills, Loki will crash #2314 and require manual intervention. Several issues have been opened requesting a feature to manage this, as existing workarounds (cronjobs, scheduled runners, etc) require rebuilding indexes. (See this and this)

Describe the solution you'd like

New config option: `size_based_retention_percentage`

Similar to –storage.tsdb.retention.size in Prometheus, it should be possible to programmatically configure a size based retention policy for Loki.

Based on the complete local example, a new option size_based_retention_percentage can be added:

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks
    size_based_retention_percentage: 80

New argument: `-store.size_based_retention_percentage`

Loki binary has already a time based retention argument:

$ /usr/bin/loki -h | grep -A1 "store.retention"
  -store.retention value
        How long before chunks will be deleted from the store. (requires compactor retention enabled). (default 31d)

Following this example, the size_based_retention_percentage in the config file can be translated into -store.size_based_retention_percentage

Possible Implementation

Loki’s chunk-tool already contains much of the logic necessary to determine the “real” storage usage from given chunks, and the datetimes for each. Extending and integrating this into an experimental argument to Loki which runs a Ticker. This Ticker can check the disk usage for block volumes, if present in the configuration, and if they exceed the interval, calculate the date/time ranges to delete via the endpoint to bring it back to a threshold.

This Timer could potentially expose a new metric on how fast disk usage is growing as well (loki_chunk_store_bytes_minutes), and emit logs from Loki itself if the log growth rate is sufficiently high to repeatedly trigger pruning.

Optionally, behavior similar to ceph (to determine how aggressive pruning should be) may be possible.

This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period.

Tentative deletion process

Check disk usage
- Re-calculate and update loki_chunk_store_bytes_minutes metric:
- (actual_disk_usage - last_disk_usage) / minutes_between_reads
- If disk usage is greater than the size_based_retention_percentage:
- Calculate the amount of data (bytes) to be deleted:
  - (actual_disk_usage_percentage - size_based_retention_percentage) * disk_size
- Iterate over chunks until the sum of the chunk's “Data length” is greater than the amount of data to be deleted and get the “Through” value of the last chunk.
- Perform a log deletion using that “Through” value as “end” parameter.

Describe alternatives you've considered

We thought about implementing an external process to delete old logs based on the size of the volume like this and this, but seems too hacky for a professional use case.

Additional context

This is a WIP feature proposal, please feel free to share your thoughts!

firepro20 commented 2 years ago

This would save us a lot of trouble, I had opened this as an issue relating to a corrupt file in index_19** folder with size too small, and then realised a few days later the /data mount was full. If we have this configurable option as a loki user value parameter, it would save several users the trouble of creating cronjobs and scripts via this one command!

simskij commented 2 years ago

This issue was prompted by this thread on the community slack, where we agreed with @09jvilla that we'd create a solution proposal for this new functionality before putting any time into providing a PR for it.

kavirajk commented 2 years ago

Thanks @Abuelodelanada for the proposal. I have gone through it and looks like a great start :)

I have few questions and clarifications

What is the rationale behind having it as percentage? rather than just normal integer size(like Prometheus does)? One pros I can think of is, with %, if Loki hit any limits, we can resize the underlying file-system or any block based storage, without even updating (or restarting) Loki. But on the other side, calculating the percentage also comes with cost of knowing and keep tracking of changes in the underlying block storage (not sure how complex it is tbh). Good to have pros and cons of each approach I think
Can we make this config available just for storage type filesystem. rather than global for all type of storage? (-store.retention)?
I didn't fully understand what you meant by

This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period

You mean the config (either size_percentage or flag retention.size-percentage) can have value "fail"? and what kind of behavior we expect in that case? not clear what problem it is trying to solve. Can you help me understand?

I would also loop people who are more familiar with code that touches, retention, compaction and deletion (@cyriltovena , @owen-d, @sandeepsukhani and @MasslessParticle @MichelHollands) to get their opinions.

cc @slim-bean

rbarry82 commented 2 years ago

I have few questions and clarifications

What is the rationale behind having it as percentage? rather than just normal integer size(like Prometheus does)? One pros I can think of is, with %, if Loki hit any limits, we can resize the underlying file-system or any block based storage, without even updating (or restarting) Loki. But on the other side, calculating the percentage also comes with cost of knowing and keep tracking of changes in the underlying block storage (not sure how complex it is tbh). Good to have pros and cons of each approach I think

Actually, a little bit the opposite of this. From the essential perspective that Loki is "cloud native", we don't want to make any assumptions about the underlying block storage mechanism. On k8s, not all CSIs support resizing PVs, and the until the PVC is changed, the PV may not be without manual intervention anyway. This can be particularly bad in StatefulSets, where administrators are forced to do a weird "dance" of making a PVC by hand, claiming it so it's allocated, then getting Loki to re-claim it in the relatively window the kube scheduler grants between a pod restart.

In addition, calculating the used percentage of any given block filesystem is a single statfs call, and calculating the "real" size implicitly carries the assumption that the calculation will need to be aware of how Loki keeps track of chunks on disk. Considering there's a tracking around using tsdb as a boltdb replacement, at least starting with an approach with less complexity resolves some longstanding RFEs/bugs for Loki users on block storage (as opposed to object storage) without tying maintainers into also keeping a feature which never quite made it into the plan up to date with any backend changes (or multiple, if tsdb and boltdb exist in the codebase simultaenously depending on the config options chosen).

Mostly, it's an effort chosen for simplicity and ease of maintenance while also providing a solution which is more coherent than cronjobs which may delete the blocks out from under Loki, causing queries to fail. Open to discussing the tradeoffs, though.

Can we make this config available just for storage type filesystem. rather than global for all type of storage? (-store.retention)?

Sure, that's easy enough to suss out of the Loki config.

I didn't fully understand what you meant by

This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period

You mean the config (either size_percentage or flag retention.size-percentage) can have value "fail"? and what kind of behavior we expect in that case? not clear what problem it is trying to solve. Can you help me understand?

The intention of that statement was more or less "Loki retention can be complicated, depending on chunk_store_config.max_look_back_period, table_manager.retention_period, and how many active tables there are". The "size-based retention" should be a well-behaved citizen. From the docs on retention, the following example priorities exist:

All tenants except 29 and 30 in the dev namespace will have a retention period of 24h hours. All tenants except 29 and 30 that are not in the dev namespace will have the retention period of 744h. For tenant 29: All streams except those in the container loki or in the namespace prod will have retention period of 168h (1 week). All streams in the prod namespace will have a retention period of 336h (2 weeks), even if the container label is loki, since the priority of the prod rule is higher. Streams that have the container label loki but are not in the namespace prod will have a 72h retention period. For tenant 30: All streams except those having the container label nginx will have the global retention period of 744h, since there is no override specified. Streams that have the label nginx will have a retention period of 24h.

The design should do its best to follow these priorities, and log messages during operations. Having a single logger go haywire and start hammering the queue should not cause the size-based retention to "squeeze out" earlier logs, logs from streams with "higher" priorities, etc. In some ways, similar to ratios in Ceph OSDs. From the proposal, it's a little vague, but we can probably agree that the worst-case scenario is some overly chatty application being deployed with debug logging and filling the entire chunk store up with its own logs while "important" services get purged.

I doubt if anyone knows these use cases and priorities (if any) should be better than the Loki maintainers, so it's kind of a "we did think about this possibility and think there should be some sort of safety valve, but the design parameters are open for discussion".

kavirajk commented 2 years ago

Thanks for clarification @Abuelodelanada.

The "size-based retention" should be a well-behaved citizen.

Are we considering multi-tenancy in the first iteration?. In that case what happens if Loki just run with global config say -retention_percent 50 and when disk hits 50%, what chunks will be deleted? from which tenants? or is it purely based on just "how old" chunks are independent of tenants?

And how does that behavior changes in case if we support per tenant config?

Basically can you more details about how exactly we delete chunks when we hit the threshold percent (both in single tenant and multi-tenant environments)?

I'm personally inclined to support(as a first iteration) just a global flag and ignore multi-tenancy and delete chunks from file-system and index purely based "how old" the chunks are. This also avoids scanning all the chunks across all the tenants I think. wdyt?

simskij commented 2 years ago

I'm personally inclined to support(as a first iteration) just a global flag and ignore multi-tenancy and delete chunks from file-system and index purely based "how old" the chunks are. This also avoids scanning all the chunks across all the tenants I think. wdyt?

Completely with you on skipping multi-tenancy for now. It should (ish) work if we, as you suggested, go by the age of the chunks. Each tenant would then do this cleanup on their own and only touch their own chunks. This might lead to data inconsistency of course, but that is inherently the case for filesystem regardless.

On k8s, not all CSIs support resizing PVs, and the until the PVC is changed, the PV may not be without manual intervention anyway.

In my point of view, it's even more the opposite of this. What if someone shrinks a PV or PVC to the point where the Loki retention threshold suddenly exceeds the PV(C) capacity? With a relative measure, like percentage, this situation can't arise, which is good news. 👍🏼

delete chunks from file-system and index

Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.

owen-d commented 2 years ago

I like this idea a lot @Abuelodelanada and the writeup is excellent. Were you thinking of implementing this yourself?

09jvilla commented 2 years ago

@owen-d thanks for reviewing :)

Were you thinking of implementing this yourself?

@owen-d , yes @Abuelodelanada and their colleagues had volunteered to implement it. I had just asked them to share their implementation plan before they got started so that the Loki maintainers could sign off on the approach. (You'll see the original context in this slack thread).

simskij commented 2 years ago

Regarding what I wrote above, @owen-d, do you have anything to add? We'd much rather reuse as much as possible of already existing routines for clean-up and just add a separate trigger, hoping it will save you some maintenance down the road. Specifically around:

Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.

rbarry82 commented 2 years ago

Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.

I think the crucial detail here (also related to @simskij's question just above) is "does using the deletion API (or just directly invoking the functions it calls) set a tombstone and a later timer performs the 'real' deletion?"

The overlapping/sliding time resolution windows, especially around active/inactive tables don't necessarily make it obvious whether it will just be marked for deletion or really deleted (and the appropriate inodes/blocks freed). I suppose it's a small step to invoke the "real" deletion afterwards if that's a step which needs to be taken, but that would be the principal reason we'd want to avoid the API.

There's enough comments in some of the older issues that the API is, from a possibly naive design perspective, preferable since it will at least adhere more closely to Loki's contract around data coherency than deleting chunks behind the scenes and causing queries to complain about the object being missing from storage, or for the labels to still exist if all of the entries they map to are actually gone from disk.

MasslessParticle commented 2 years ago

There are some caveats with the delete API and implementation:

Deletion happens as part of the retention process in the compaction lifecycyle. That means a Loki that implements this feature will need to have an retention-interval short enough to keep the disk from filling up. We've also seen issues where many deletes or deletes that span a lot of data can prevent compaction from completing.

We recently made several changes to the API so an operator can tune deletions to ensure compaction completion. The relevant PRs:

Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.

All that said, I think this should work. Retention-style deletes only require stream matchers and times. Those deletes run substantially faster than deletes with filters because we just drop the chunk.

simskij commented 2 years ago

Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.

I agree that this might become an issue, although I would say that the root causes and their resolution likely is a bit disconnected from the feature proposal.

I'm sure I've missed a scenario, but:

Your log signals are emitting too fast due to a surge in usage, or (but more likely "and")
Your storage is under-provisioned.

In both cases the outcome would be either loss of vision due to recent logs being pushed out, or if Loki can't keep up; running out of disk space and eventually crashing.

In both cases, my gut feeling tells me that it should be solved with alert rules and reconfiguration of the storage rather than by Loki itself.

One thing to consider however; which probably should be a separate ticket is whether Loki should continue to try to write to disk (and crash) if the disk is full or if it should deny incoming requests but stay alive, meaning it would still be able to run compaction as usual.

owen-d commented 2 years ago

Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.

I'm less worried about keeping up b/c we'd be pruning the index+dropping chunks from the file system, so there would be no chunk rewriting which is the real compaction bottleneck.

cc @sandeepsukhani Can you take a look here? What do you think about using the deletion API vs another way to handle this?

sandeepsukhani commented 2 years ago

cc @sandeepsukhani Can you take a look here? What do you think about using the deletion API vs another way to handle this?

I think it would be better to do it without using the deletion API. If I were to implement it, then I think I will leverage the same code used for retention and deletion, which would basically satisfy this interface here and hook it in compactor. The problem here though is the actual chunks are not deleted immediately; they are deleted after 2 hours by default controlled by this config. We can solve this separately if you think this approach makes sense.

sandeepsukhani commented 2 years ago

The other option would be have a separate code path which would use OpenCompactedIndex, which would give you access to ChunkIterator for iterating through chunks so that you can just delete them and remove the index/update file at the end.

This would reuse most of the code, keep it simple and instantly delete the chunks.

rbarry82 commented 2 years ago

The interface definitely looks like it provides a lot of functionality. From a quick look, it seems as if reusing what's already present with ExpirationChecker (as in the fulfilled interface through a NewExpirationChecker) in combination with a ChunkClient or chunk client.Client is a good place to start, but please let me know if that's a blind alley

sandeepsukhani commented 2 years ago

The approach I mentioned in my second comment looks more appealing to me which gives more control on the flow and you get to delete the chunks instant instead of having to wait for some time. But please feel free to share your thoughts. You can also try playing around the code to get the feel of it.

R-Studio commented 2 years ago

This is a bit related to this issue.

Abuelodelanada commented 1 year ago

Hi @sandeepsukhani @owen-d

I opened a WIP PR for this: https://github.com/grafana/loki/pull/7927

I would love to have your comments, since I have some doubts :-)

grafana / loki