Closed amkartashov closed 4 years ago
So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.
I want to also point out that the helm chart wasn't meant to be scaled, for that we prefer to recommend tanka production/ksonnet
.
Also let's ask our ring expert @pstibrany.
Are you running ingesters on those preemptible nodes? From what I'm reading, I would strongly recommend not doing that. Ingesters take a while to shutdown correctly (during rollout of new version, new ingester is starting while another one is stopping, and stopping ingester then tries to handover its data to starting one). You can disable handover by using -ingester.max-transfer-retries=0
(or max_transfer_retries
field under ingester
in YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.
If ingester is left in ring in LEAVING state, it means that it has failed to perform its shutdown correctly – and then it's best to find out why. When running on preemptible nodes, it is very likely that ingester has been killed before it was able to handover data or flush it properly, since pods on preemptible are not shutdown gracefully. This is bad, and can lead to data loss.
@pstibrany yep, it's preemptible pool. But:
Anyway, skipping arguing about preempt pros and cons, do you think staying in LEAVING forever is a normal behavior? If I delete the pod which is in LEAVING, it shuts down gracefully and goes ACTIVE after restart, why it does not shut itself down after some meaningful timeout while in LEAVING? I don't see the need in manual action here.
In other words, I'm OK with partial data loss (logs from single node due to unexpected shutdown) - that's why I use preempt nodes. I'm not OK with application requiring manual restart to heal - data loss already happened, why wait further?
So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.
Forget does not work for me, node just reappears LEAVING in a matter of seconds
If ingester is OOMkilled, that's bad and can result in loss of data. That's also why we use replication. Other Loki components are not stateful (well, query frontends keep queue of queries, but if you lose queries, it's not a big deal). Ingesters keep lot of data in memory, and need proper shutdown.
It's interesting that you point out that ingester reappears with LEAVING State... it means it's running. I think that's because when new ingester started, it simply picked up state that was in the ring (LEAVING) and is now running with this state. You can confirm by checking for existing entry found in ring
entry in the log.
I think best what you can do is to 1) stop this ingester, 2) forget it from the ring if it is still there after stopping, 3) restart it agan.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
it means it's running
It is. The thing is that I'm using 'all in one' image, not sure how it'll work in case of single ingester process.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
@pstibrany thanks Peter! Looking for it.
For the reference, here is relevant change: https://github.com/cortexproject/cortex/pull/1921.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
Sounds good, any date for the next release?
Sounds good, any date for the next release?
real soon now, I think 😉
/cc @slim-bean
UPDATE: unfortunately, I was testing it on non-preemptible nodes by mistake. On preemptible it keeps failing even with -ingester.max-transfer-retries=0
:( I'm moving this to non-preemptible nodes.
@pstibrany, I'm testing below w/a for a few days now, looks like it works - I haven't seen ingester in the LEAVING state since changing configuration. Looks like 30 seconds (preemption time) is enough for ingester to flush data to storage.
You can disable handover by using
-ingester.max-transfer-retries=0
(ormax_transfer_retries
field underingester
in YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.
This behavior can also be forced by sending a SIGKILL or deleting pods with grace-period=0.
Maybe it makes sense to test this in a chaoskube way?
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
The problem is still here (non-preemptible machine isn't a fix)
So I would like to add something regarding the SIGTERM / SIGKILL with GKE preemptible Nodes. The explanation done by the docs about preemptibles is for "normal" instances and not for workers.
It seems GKE is not picking up this signal to terminate pods gradually. So what happens is that the pod is instantly killed, which makes -ingester.max-transfer-retries=0
unusable. There is no time for any retry or whatsoever.
To be able to notify your pods of termination, you will need to following Daemonset running on your preemtibles.
I just installed it and didn't do any tests yet, but just wanted to inform you 😄
I've experiment k8s-node-termination-handler, it's better but still need some tests (and time) for ensure that ""solve"" the problem
Just fews information for make this "working"
--system-pod-grace-period=0s
) -> give 30sec for graceful shutdownserver.graceful_shutdown_timeout=5s
, ingester.lifecycler.final_sleep=0s
and ingester.max_transfer_retries=20
max_transfer_retries
> 20 ? Because is faster to transfer chunk to the new ingester than to flush to the store (on my side, it was impossible to flush in less than 30sec)final_sleep
-> care here, if you expose prometheus metrics you can lost itFIY, when SIGTERM happens ingester do (in this order)
ingester.lifecycler.final_sleep
secondBut it's important to mention that work only in best case scenario, if node crash or shutdown is longer than 30sec -> ingester will be stuck in unhealthy state
EDIT: in the last 1h, 4 ingesters was stuck in unhealthy state, I guess this isn't a viable solution ...
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Describe the bug
After/during node reboot (preemptible node in GKE, reboot time is limited to ~30 seconds), ingester pod goes into LEAVING state forever, until the pod is deleted manually.
To Reproduce
Deploy with helm on preempt nodes. Every time node is preempted, one of the ingesters goes to LEAVING and keeps running in this state (/ready return Ready, so pod is marked as healthy and ready).
Expected behavior
Ingester should either heal itself or exit after unclean restart and finding previous state LEAVING
Environment:
Deployment is done from loki helm chart 0.25.1
Loki version string: version=v1.3.0, branch=HEAD, revision=5e527a03
Helm values file:
Promtail(daemonset)/cassandra(sts)/etcd(sts) are deployed separately to the same preempt nodes and experience no issues, except promtail not able to push logs to Loki after 3/5 of them go to LEAVING.
Screenshots, Promtail config, or terminal output
Discussion in slack: https://grafana.slack.com/archives/CEPJRLQNL/p1584242451454400
Logs for starting Loki instance after node reboot:
Problem might come from cortex not having a path out from LEAVING state on startup: https://github.com/cortexproject/cortex/blob/3374e33bde0e805f9955fc8ba4caf0a4b96e04ca/pkg/ring/lifecycler.go#L363