Open charris-ca opened 1 year ago
Sounds like your ingesters aren't part of the memberlist ring. What does /ring
look like from your ingesters?
/ring looks fine,
Another thing I noticed, query_ingesters_within
is set to 3h
by default, so you will probably want to adjust chunk_idle_period
accordingly. In general you should prefer for chunks to be flushed closer to target size, so you should look to align it with query_ingesters_within
with a reasonable number (I'd say default 3h is fine for most use cases).
I had chunk_idle_period at the default time period (30m) and 2 hours set previously during testing and was seeing the same errors. Could I set query_ingesters_within to a very low number so that it will go to the object store sooner? It's still not explaining why ingesters are not able to have the same data or combine their results on query return
Perhaps someone with more experience can comment on this as well, but for me I try to line up query_ingesters_within
with both chunk_idle_period
and max_chunk_age
. You can set querier to go to object storage sooner (you'd want to adjust both idle and max age of course), but you should want to not do that, because it's both faster and more efficient to not have to query object store too early.
And it kinda explain why your ingester would return inconsistent logs, because you have 5 ingesters, each of them would flush idle chunks at different time, so it's conscievable that results would differ everytime you query as well. I'd say give it a try by lining up those configurations and see if it helps.
I tested setting those to 3 hours, but still getting the same problem. I also tested setting, query_store_only=true. And the logs were not showing up at all for 10-15 minutes, then popped up.
I think you're on to something with the ring communication though. Is it possible that the queries are hitting the memberlist for a specific ingester, but it is being load balanced back to the same ingester? Thinking that it is collecting the logs from all ingesters.
What does /ring look like from your querier?
Looks normal, those are all the correct internal IPs,
@charris-ca I had a possibly related issue with Loki 2.9 + Promtail 2.9 with the local filesystem storage. After a version upgrade, the logs started to disappear in the period approximately between -2 hours and -2 days from all feeds. I tried various config tweaks - alas, to no avail. Eventually I reverted back to Loki 2.6 and the issue disappeared.
Describe the bug Logs are ingesting correctly and getting uploaded to our GCP bucket correctly. But on query time, the logs are inconsistent. Generally I see about ~50% of the logs show up for the first 25 minutes. After that time, the same query will return 100% of the logs. There is a window in between 15-25 minutes where a query will vary between the ~50% or 100% return on each execute.
I have increased all of the timeout/query related values that I have found. I have also reduced chunk_idle_period to 1m, but still seeing the issue. I have also turned off all memcached instances and not using results cache for testing.
If I restart all loki pods, logs will completely disappear. But will re-appear after I hit 15-25 minutes from when the log was ingested.
I have tracing setup, and things appear to be getting broken at the /logproto.Querier/QuerySample on the querier component. I'm getting a context cancelled (screenshot below)
To Reproduce Steps to reproduce the behavior:
requestedCardBrand_seq14
Expected behavior Logs should be directly queriable right after they are ingested
Environment:
Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.
Log sample of subsequent queries returning different results below. First log line returns total_entries=3, the third returns total_entries=10, and the fifth returns total_entries=3 again.
Loki config:
Showing 6 entries: Same query 20 minutes later, showing 11 entries: