Open splitice opened 2 years ago
@liguozhong since you are on the performance drive, perhaps this would be of interest to you. It's due to your great work that chunk fetching latency is now the weakest link for us. :)
Maybe your workload is similar
Second this. What we have found time and time again is that, no matter how overpowered your storage is (up to and including fancy all-flash storage attached via InfiniBand) tiny files always give bad performance.
I haven't noticed too many small files in s3 before. If this feature can be done by someone, we will definitely use it. I amlooking forward to loki-team can implement chunk-compactor
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Go away stalebot
I was observing Mimir's bucket, hoping that Loki would have the same functionality as Mimir. 🥺
Mimir has successfully solved a similar problem. bucket_compactor.go will organize the bucket objects every 12 hours or every 24 hours later, so the number of objects and Bucket Bytes will be very compact.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Get lost stale bot
+1
+1
Hi Is there any plans to provide chunk compaction?
This feature is much needed.
+1
Hello,
my apps can't provide enough logs to create good chunk before pod/node rotation. I.e. I've set _max_chunkage and _chunk_idleperiod to 12 hours but my app rotates after 2 hours (AutoScalling in Kubernetes).
This is why imo this feature is really good idea.
has this been implemented?
No this does not exist
ll chunks/fake/ | wc -l
1323523
We have insane number of sub directories in chunks directory.
I think loki could create a file struct like database, include index and chunks in one single file or few multiple files.
Not one stream in one file.
any update here?
Solved this problem on my side. Who face the problem first of all check your labels and streams and read careful this https://grafana.com/docs/loki/latest/get-started/labels/bp-labels/ (Use dynamic labels sparingly section)
you can check current labels using logcli series --analyze-labels '{}'
also prometheus metric rate(loki_ingester_chunks_flushed_total[1m])
is a good metric to understand whats happening
There is my before
Total Streams: 2240
Unique Labels: 17
Label Name Unique Values Found In Streams
thread 103 2087
task_name 80 2170
job_name 68 2170
filename 50 97
host 38 99
namespace 7 2170
service 6 72
project 6 72
dc 6 2170
logger 5 2143
severity 3 2137
allocation_id 2 3
source 2 29
environment 2 41
role 2 41
app 1 2137
agent 1 68
and after adding thread label to message instead
Total Streams: 255
Unique Labels: 16
Label Name Unique Values Found In Streams
task_name 80 188
job_name 68 188
filename 49 94
host 38 96
namespace 7 188
project 6 69
dc 6 188
service 6 69
logger 5 161
severity 3 154
source 2 29
role 2 38
allocation_id 2 4
environment 2 38
agent 1 65
app 1 154
Amount of logs has not changed
also some tunes for loki config (changes only)
compactor:
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 20m
ingester:
chunk_idle_period: 60m
chunk_retain_period: 60m
max_chunk_age: 4h #i think is too big but is ok for me now
chunk_target_size: 54857600
And the results of number of files in minio storage (before and after all this tunes)
+1
The absence of this makes S3 less suitable as the Loki store.
Solved this problem on my side. Who face the problem first of all check your labels and streams and read careful this https://grafana.com/docs/loki/latest/get-started/labels/bp-labels/ (Use dynamic labels sparingly section)
you can check current labels using
logcli series --analyze-labels '{}'
also prometheus metricrate(loki_ingester_chunks_flushed_total[1m])
is a good metric to understand whats happeningThere is my before
Total Streams: 2240 Unique Labels: 17 Label Name Unique Values Found In Streams thread 103 2087 task_name 80 2170 job_name 68 2170 filename 50 97 host 38 99 namespace 7 2170 service 6 72 project 6 72 dc 6 2170 logger 5 2143 severity 3 2137 allocation_id 2 3 source 2 29 environment 2 41 role 2 41 app 1 2137 agent 1 68
and after adding thread label to message instead
Total Streams: 255 Unique Labels: 16 Label Name Unique Values Found In Streams task_name 80 188 job_name 68 188 filename 49 94 host 38 96 namespace 7 188 project 6 69 dc 6 188 service 6 69 logger 5 161 severity 3 154 source 2 29 role 2 38 allocation_id 2 4 environment 2 38 agent 1 65 app 1 154
Amount of logs has not changed
also some tunes for loki config (changes only)
compactor: compaction_interval: 10m retention_enabled: true retention_delete_delay: 20m ingester: chunk_idle_period: 60m chunk_retain_period: 60m max_chunk_age: 4h #i think is too big but is ok for me now chunk_target_size: 54857600
And the results of number of files in minio storage (before and after all this tunes)
Good for you man. But this problem is present in correctly (at least I think they are) configured Loki instances as well. For example when you have a hundreds or thousands of short lived pods daily (easy when you have batch jobs) you can rack up files on s3 quickly. Why? Because pod name itself is great example of Loki's anti-pattern! Most people want at least namespace, pod and container in Loki's index mostly because you have no guarantee of any other metadata being present.
+1. We have a lot of small batch jobs that generate very few logs and then go idle, but they're logged in different tenants so can't be combined easily. And even if we did solve this, the 100+ million existing small chunk files would still be there forever.
I've worked around this issue by using structured metadata. Structured metadata + bloom filters works really nice.
I've worked around this issue by using structured metadata. Structured metadata + bloom filters works really nice.
Can you describe your method in more detail?
You need to define labels that are not indexed but instead moved to structured metadata in you alloy/agent config file. below is a fragment of my config file that does that and also removes all labels that I don't really need.
loki.process "pods" {
forward_to = [loki.write.default.receiver]
stage.structured_metadata {
values = {
pod = "",
container = "",
}
}
stage.label_drop {
values = [ "filename", "stream", "pod", "container" ]
}
As I mentioned above I have a lot of small pods created daily, that don't write that much of log data. With this change, all my data is indexed only behind namespace label but I can query individual pods and containers like this:
{namespace="$namespace"} | pod="$pod"
Thanks to this number of streams basically collapsed. Heres how many files per second were created before and after this change (not on 22/04/2024 - I was playing around and loki was down, but after 24/04/2024):
It will all work well just with that for most small clusters, but if you have a lot of data you may want to enable experimental bloom filter queries that are described in documentation, that will make these queries fly, by reading only lines that have pods that you queried for instead of everything from queried namespace and only then filtering by pod.
Is your feature request related to a problem? Please describe.
Too many small chunks in S3, unable to be solved by the continued increase of idle timeout due to the huge memory increase that settings results in.
With some queries needing to fetch 90,000 chunks 50-100 big chunks, 89900+ smaller chunks these smaller chunks can be the bottleneck for many queries. Quite often these smaller chunks exist because their source has bursts of activity infrequently. It would be far more ideal if this was <1,000 good sized chunks (still enough to parallize over multiple cores) were queried instead (closer to the number of streams).
Describe the solution you'd like
A utility similar to compactor (or built in?) that is able to create new chunks by merging small chunks (i.e <10KB which is 95%+ of our dataset) that had been pushed due to idle period (but later there was matching data).
Fetching these chunks is particularly expensive and most of the time spent downloading chunks. It might also improve compression ratios (if blocks are rebuilt).
Placing this in compactor might be a good idea since the index is already being updated at this time.
This compactor should get a setting like
sync_period
to bound the combine search. For most people this should be the same value as indexerssync_period
. Chunk max size would still need to be honoured of course. Larger chunks, not just one chunk.Something like:
New chunks should be entirely new (new id) and old chunks removed
index_cache_validity
after the index containing only the new chunks is updated (to prevent cached indexes from accessing the now non existant chunks).If the chunk compactor exits uncleanly (or has any similar issue) unreferenced chunks may end up in the chunk store. AFAIK this is possible currently regardless and probably is a seperate matter.
Describe alternatives you've considered
Increasing
chunk_idle_period
(currently 6m) further. 10m was tested however resulted in too much memory being consumed.Screenshot showing issue 1 week retention
May resolve
1258, #4296