Open Abuelodelanada opened 2 years ago
This would save us a lot of trouble, I had opened this as an issue relating to a corrupt file in index_19** folder with size too small, and then realised a few days later the /data mount was full. If we have this configurable option as a loki user value parameter, it would save several users the trouble of creating cronjobs and scripts via this one command!
This issue was prompted by this thread on the community slack, where we agreed with @09jvilla that we'd create a solution proposal for this new functionality before putting any time into providing a PR for it.
Thanks @Abuelodelanada for the proposal. I have gone through it and looks like a great start :)
I have few questions and clarifications
What is the rationale behind having it as percentage? rather than just normal integer size(like Prometheus does)? One pros I can think of is, with %, if Loki hit any limits, we can resize the underlying file-system or any block based storage, without even updating (or restarting) Loki. But on the other side, calculating the percentage also comes with cost of knowing and keep tracking of changes in the underlying block storage (not sure how complex it is tbh). Good to have pros and cons of each approach I think
Can we make this config available just for storage type filesystem
. rather than global for all type of storage? (-store.retention
)?
I didn't fully understand what you meant by
This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period
You mean the config (either size_percentage
or flag retention.size-percentage
) can have value "fail"? and what kind of behavior we expect in that case? not clear what problem it is trying to solve. Can you help me understand?
I would also loop people who are more familiar with code that touches, retention, compaction and deletion (@cyriltovena , @owen-d, @sandeepsukhani and @MasslessParticle @MichelHollands) to get their opinions.
cc @slim-bean
I have few questions and clarifications
- What is the rationale behind having it as percentage? rather than just normal integer size(like Prometheus does)? One pros I can think of is, with %, if Loki hit any limits, we can resize the underlying file-system or any block based storage, without even updating (or restarting) Loki. But on the other side, calculating the percentage also comes with cost of knowing and keep tracking of changes in the underlying block storage (not sure how complex it is tbh). Good to have pros and cons of each approach I think
Actually, a little bit the opposite of this. From the essential perspective that Loki is "cloud native", we don't want to make any assumptions about the underlying block storage mechanism. On k8s, not all CSIs support resizing PVs, and the until the PVC is changed, the PV may not be without manual intervention anyway. This can be particularly bad in StatefulSet
s, where administrators are forced to do a weird "dance" of making a PVC by hand, claiming it so it's allocated, then getting Loki to re-claim it in the relatively window the kube scheduler grants between a pod restart.
In addition, calculating the used percentage of any given block filesystem is a single statfs
call, and calculating the "real" size implicitly carries the assumption that the calculation will need to be aware of how Loki keeps track of chunks on disk. Considering there's a tracking around using tsdb as a boltdb replacement, at least starting with an approach with less complexity resolves some longstanding RFEs/bugs for Loki users on block storage (as opposed to object storage) without tying maintainers into also keeping a feature which never quite made it into the plan up to date with any backend changes (or multiple, if tsdb and boltdb exist in the codebase simultaenously depending on the config options chosen).
Mostly, it's an effort chosen for simplicity and ease of maintenance while also providing a solution which is more coherent than cronjobs which may delete the blocks out from under Loki, causing queries to fail. Open to discussing the tradeoffs, though.
- Can we make this config available just for storage type
filesystem
. rather than global for all type of storage? (-store.retention
)?
Sure, that's easy enough to suss out of the Loki config.
- I didn't fully understand what you meant by
This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the retention_period
You mean the config (either
size_percentage
or flagretention.size-percentage
) can have value "fail"? and what kind of behavior we expect in that case? not clear what problem it is trying to solve. Can you help me understand?
The intention of that statement was more or less "Loki retention can be complicated, depending on chunk_store_config.max_look_back_period
, table_manager.retention_period
, and how many active tables there are". The "size-based retention" should be a well-behaved citizen. From the docs on retention, the following example priorities exist:
All tenants except 29 and 30 in the dev namespace will have a retention period of 24h hours. All tenants except 29 and 30 that are not in the dev namespace will have the retention period of 744h. For tenant 29: All streams except those in the container loki or in the namespace prod will have retention period of 168h (1 week). All streams in the prod namespace will have a retention period of 336h (2 weeks), even if the container label is loki, since the priority of the prod rule is higher. Streams that have the container label loki but are not in the namespace prod will have a 72h retention period. For tenant 30: All streams except those having the container label nginx will have the global retention period of 744h, since there is no override specified. Streams that have the label nginx will have a retention period of 24h.
The design should do its best to follow these priorities, and log messages during operations. Having a single logger go haywire and start hammering the queue should not cause the size-based retention to "squeeze out" earlier logs, logs from streams with "higher" priorities, etc. In some ways, similar to ratios in Ceph OSDs. From the proposal, it's a little vague, but we can probably agree that the worst-case scenario is some overly chatty application being deployed with debug logging and filling the entire chunk store up with its own logs while "important" services get purged.
I doubt if anyone knows these use cases and priorities (if any) should be better than the Loki maintainers, so it's kind of a "we did think about this possibility and think there should be some sort of safety valve, but the design parameters are open for discussion".
Thanks for clarification @Abuelodelanada.
The "size-based retention" should be a well-behaved citizen.
Are we considering multi-tenancy in the first iteration?. In that case what happens if Loki just run with global config say -retention_percent 50
and when disk hits 50%, what chunks will be deleted? from which tenants? or is it purely based on just "how old" chunks are independent of tenants?
And how does that behavior changes in case if we support per tenant config?
Basically can you more details about how exactly we delete chunks when we hit the threshold percent (both in single tenant and multi-tenant environments)?
I'm personally inclined to support(as a first iteration) just a global flag and ignore multi-tenancy and delete chunks from file-system and index purely based "how old" the chunks are. This also avoids scanning all the chunks across all the tenants I think. wdyt?
I'm personally inclined to support(as a first iteration) just a global flag and ignore multi-tenancy and delete chunks from file-system and index purely based "how old" the chunks are. This also avoids scanning all the chunks across all the tenants I think. wdyt?
Completely with you on skipping multi-tenancy for now. It should (ish) work if we, as you suggested, go by the age of the chunks. Each tenant would then do this cleanup on their own and only touch their own chunks. This might lead to data inconsistency of course, but that is inherently the case for filesystem
regardless.
On k8s, not all CSIs support resizing PVs, and the until the PVC is changed, the PV may not be without manual intervention anyway.
In my point of view, it's even more the opposite of this. What if someone shrinks a PV or PVC to the point where the Loki retention threshold suddenly exceeds the PV(C) capacity? With a relative measure, like percentage, this situation can't arise, which is good news. 👍🏼
delete chunks from file-system and index
Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.
I like this idea a lot @Abuelodelanada and the writeup is excellent. Were you thinking of implementing this yourself?
@owen-d thanks for reviewing :)
Were you thinking of implementing this yourself?
@owen-d , yes @Abuelodelanada and their colleagues had volunteered to implement it. I had just asked them to share their implementation plan before they got started so that the Loki maintainers could sign off on the approach. (You'll see the original context in this slack thread).
Regarding what I wrote above, @owen-d, do you have anything to add? We'd much rather reuse as much as possible of already existing routines for clean-up and just add a separate trigger, hoping it will save you some maintenance down the road. Specifically around:
Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.
Should this be interpreted as a suggestion not to use the deletion API in the first iteration? The idea was that it would make the delete procedure coherent with how the retention periods work already.
I think the crucial detail here (also related to @simskij's question just above) is "does using the deletion API (or just directly invoking the functions it calls) set a tombstone and a later timer performs the 'real' deletion?"
The overlapping/sliding time resolution windows, especially around active/inactive tables don't necessarily make it obvious whether it will just be marked for deletion or really deleted (and the appropriate inodes/blocks freed). I suppose it's a small step to invoke the "real" deletion afterwards if that's a step which needs to be taken, but that would be the principal reason we'd want to avoid the API.
There's enough comments in some of the older issues that the API is, from a possibly naive design perspective, preferable since it will at least adhere more closely to Loki's contract around data coherency than deleting chunks behind the scenes and causing queries to complain about the object being missing from storage, or for the labels to still exist if all of the entries they map to are actually gone from disk.
There are some caveats with the delete API and implementation:
Deletion happens as part of the retention process in the compaction lifecycyle. That means a Loki that implements this feature will need to have an retention-interval short enough to keep the disk from filling up. We've also seen issues where many deletes or deletes that span a lot of data can prevent compaction from completing.
We recently made several changes to the API so an operator can tune deletions to ensure compaction completion. The relevant PRs:
Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.
All that said, I think this should work. Retention-style deletes only require stream matchers and times. Those deletes run substantially faster than deletes with filters because we just drop the chunk.
Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.
I agree that this might become an issue, although I would say that the root causes and their resolution likely is a bit disconnected from the feature proposal.
I'm sure I've missed a scenario, but:
In both cases the outcome would be either loss of vision due to recent logs being pushed out, or if Loki can't keep up; running out of disk space and eventually crashing.
In both cases, my gut feeling tells me that it should be solved with alert rules and reconfiguration of the storage rather than by Loki itself.
One thing to consider however; which probably should be a separate ticket is whether Loki should continue to try to write to disk (and crash) if the disk is full or if it should deny incoming requests but stay alive, meaning it would still be able to run compaction as usual.
Even with these changes, if the size of your data is growing quickly enough, Loki may not be able to process deletes fast enough to keep your disk from filling up.
I'm less worried about keeping up b/c we'd be pruning the index+dropping chunks from the file system, so there would be no chunk rewriting which is the real compaction bottleneck.
cc @sandeepsukhani Can you take a look here? What do you think about using the deletion API vs another way to handle this?
cc @sandeepsukhani Can you take a look here? What do you think about using the deletion API vs another way to handle this?
I think it would be better to do it without using the deletion API. If I were to implement it, then I think I will leverage the same code used for retention and deletion, which would basically satisfy this interface here and hook it in compactor. The problem here though is the actual chunks are not deleted immediately; they are deleted after 2 hours by default controlled by this config. We can solve this separately if you think this approach makes sense.
The other option would be have a separate code path which would use OpenCompactedIndex, which would give you access to ChunkIterator for iterating through chunks so that you can just delete them and remove the index/update file at the end.
This would reuse most of the code, keep it simple and instantly delete the chunks.
The interface definitely looks like it provides a lot of functionality. From a quick look, it seems as if reusing what's already present with ExpirationChecker
(as in the fulfilled interface through a NewExpirationChecker
) in combination with a ChunkClient
or chunk client.Client
is a good place to start, but please let me know if that's a blind alley
The approach I mentioned in my second comment looks more appealing to me which gives more control on the flow and you get to delete the chunks instant instead of having to wait for some time. But please feel free to share your thoughts. You can also try playing around the code to get the feel of it.
Hi @sandeepsukhani @owen-d
I opened a WIP PR for this: https://github.com/grafana/loki/pull/7927
I would love to have your comments, since I have some doubts :-)
Is your feature request related to a problem? Please describe.
Currently, Loki users who are not using object-based storage may accidentally fill all available space on a disk or PVC with logs on block-based storage. This is especially common with Loki running in small adhoc k8s clusters like microk8s.
Loki provides endpoints for deleting logs based on dates or queries, but no “automatic” method for ensuring that block storage remains within a defined threshold.
If block storage fills, Loki will crash #2314 and require manual intervention. Several issues have been opened requesting a feature to manage this, as existing workarounds (cronjobs, scheduled runners, etc) require rebuilding indexes. (See this and this)
Describe the solution you'd like
New config option:
size_based_retention_percentage
Similar to
–storage.tsdb.retention.size
in Prometheus, it should be possible to programmatically configure a size based retention policy for Loki.Based on the complete local example, a new option
size_based_retention_percentage
can be added:New argument:
-store.size_based_retention_percentage
Loki binary has already a time based retention argument:
Following this example, the size_based_retention_percentage in the config file can be translated into
-store.size_based_retention_percentage
Possible Implementation
Loki’s chunk-tool already contains much of the logic necessary to determine the “real” storage usage from given chunks, and the datetimes for each. Extending and integrating this into an experimental argument to Loki which runs a Ticker. This Ticker can check the disk usage for
block
volumes, if present in the configuration, and if they exceed the interval, calculate the date/time ranges to delete via the endpoint to bring it back to a threshold.This Timer could potentially expose a new metric on how fast disk usage is growing as well (
loki_chunk_store_bytes_minutes
), and emit logs from Loki itself if the log growth rate is sufficiently high to repeatedly trigger pruning.Optionally, behavior similar to ceph (to determine how aggressive pruning should be) may be possible.
This argument should be configurable to “fail” (and log errors or another mechanism) if it is in conflict with the
retention_period
.Tentative deletion process
(actual_disk_usage - last_disk_usage) / minutes_between_reads
(actual_disk_usage_percentage - size_based_retention_percentage) * disk_size
Describe alternatives you've considered
We thought about implementing an external process to delete old logs based on the size of the volume like this and this, but seems too hacky for a professional use case.
Additional context
This is a WIP feature proposal, please feel free to share your thoughts!