Open rytswd opened 2 years ago
Based on your report, I think there are two distinct issues to investigate:
According to the stack trace you shared, the SIGBUS is triggered by this call:
github.com/grafana/mimir/pkg/util/activitytracker.(*ActivityTracker).Insert(0xc000997270, 0xc00a9b9420)
What we do here is writing to the activity tracker log file. The file is mmap-ed, so it's a file on disk which gets mapped into memory. We write there each time a query is run.
You mentioned that the issue doesn't happen if you started Mimir just recently. Mimir keeps the last 24h of data on ingesters disk (by default) which makes me think you actually have exausted the disk space when this issue happens, so I would start investigating from (2).
Have you got a chance to look inside the disk, and see what actually takes the space there?
@pracucci Thanks for the detailed explanation! I took a further look into the disk usage - it was in fact WAL taking up so much more space than I originally anticipated. I got two follow-up questions on this:
Although I can understand why it crashed, I would have hoped Mimir to gracefully handle the full disk error, similar to WAL. In my setup, once an ingester panics, it won't come up due to the insufficient disk space, and it was a bit cumbersome to investigate the disk usage. The panic log was a clear indication something went completely off, but I feel panic may be a bit too extreme.
If this could be a potential enhancement, I'd be keen to take a stab at the actual implementation as well ☺️
I can see blocks_storage.tsdb.retention_period
is set to 24h by default, but is this the right one, and are there other flags I should be adjusting?
For my use case, I would like to send the data to S3 or other object storage rather quickly, but couldn't figure out which set of configurations should be updated from the doc... It would be great if you could provide some pointers for what to look for / to be aware of 🙏
- Should Mimir panic when there is no disk space for the activity tracker logs?
Ideally no. The activity tracker is not an essential feature, so it shouldn't crash the process if there's no space left on disk. Because of this, I'm open to a fix unless it significantly completes the code (see below).
That being said, is disk exhausted really different from an out of memory issue? Generally speaking, the system has exhausted a non-compressible resource (e.g. memory or disk) and the process can't work anymore as expected. The process crashes if memory is exhausted, and I personally prefer processes that make it very loud that an essential resource has been exhausted, so I don't even think it's that bad.
- What are the flags to be adjusted for data retention setup?
The short answer is to reduce blocks_storage.tsdb.retention_period
to a value not lower than 13h.
If you want to reduce it to a value lower than 13h, there are other configuration values to fine tune to get the system correctly working on the read-path (otherwise you will see gaps in your queries, because by default we query the last 12h of data only from ingesters). We generally strongly recommend to run Mimir with the default config (which has been tuned based on Grafana Labs experience) and just increase disk size if possible :)
Sorry for my delayed response, thanks for all the details!
I agree that the disk space error is similar to OOM, and the error being loud and clear makes sense. While I do understand services failing to become "ready", failing with a panic here seems to be a bit too crude IMHO. It was quite difficult to pinpoint the error cause when it panic'ed the first time, and I could only figure out the reproduction steps days after 😅
For the retention period, I appreciate those details, I wouldn't have known if you didn't point them out. I may have missed, but is there some documentation as to what values can be tweaked without affecting other dependent setup? While the configuration flexibility gives us a lot of power, I am probably missing the lot of nuances for each field made available for user configurations...
For this specific use case, though, I am opting to go with the disk size update as you suggested. So please go ahead closing the ticket if no further actions are needed based on our conversation. If there is anything I can help around error/panic handling and/or documentation, please let me know, I'd be happy to contribute back 🥰
Thanks for your follow up (and sorry for this very late reply).
I may have missed, but is there some documentation as to what values can be tweaked without affecting other dependent setup?
I'm not sure I understand this question. If you're referring to the fact that some configuration parameters need to set on multiple components, then we recommend to configure Mimir via YAML and not CLI flags, and then share the same YAML config across all components.
Describe the bug
I have been testing Mimir with Istio in the same cluster, but Ingester pods get killed when queried against, which seems to happen only when Mimir has been running for a while (around 24 hours). After the SIGBUS, Ingester fails to come back up due to the no disk space on PV (it seems as if PV gets completely filled up when SIGBUS happens?).
To Reproduce
Steps to reproduce the behavior:
At this point, somehow Ingester's PVs get completely full, and cannot start up anymore with "no disk space" error. It looks as if the SIGBUS error is also causing the PV to be filled up completely.
Note that, simply deploying Mimir and querying shortly after works just fine. This error is reproducible after running the metrics for a while.
Expected behavior
Ingester should keep running healthily.
Environment
Additional Context
I'm testing Mimir with Istio in the same cluster. There are a few modifications I made on the Helm Chart such as port names etc., but nothing major. I don't believe Istio is causing the error from the logs below, but any help appreciated!