Closed tamalsaha closed 1 year ago
This means the node has seen the metaleader but has not caught up yet. This should be a transient condition under normal circumstances.
It seems to have resolved on its own. One thing we have noticed is that the startupProbe
and the livenessProbe
can cause problems when the pod in a situation like this for too long. The probe can fail the health check and the containers restart.
On restart, does the nats container resume from last run or from some checkpoint state?
We have seen that in case of databases, if a db container restarts while WAL recovery was running, on restart the database container will again start WAL recovery from the beginning. If probes are used in such cases and restarts db containers before the WAL recovery, the containers never recover. Does nats has any such issue? Because I removes the probes to avoid the nats containers from restarting while this recovery thing was going on. By nats recovery, I saw lot of logs like below:
[8] 2022/10/17 12:58:26.368413 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368417 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368419 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368423 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368426 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368429 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.375907 [WRN] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Wrong index, ae is &{leader:dVlUzYpg term:53978 commit:23922 pterm:53978 pindex:23922 entries: 2}, index stored was 27970, n.pindex is 23922
[8] 2022/10/17 12:58:26.375930 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Truncating and repairing WAL
[8] 2022/10/17 12:58:26.375950 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Canceling catchup subscription since we are now up to date
[8] 2022/10/17 12:58:26.375992 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry did not match 53978 23924 with 53978 23922
[8] 2022/10/17 12:58:26.376080 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.376095 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
Depends on how the NATS system was shutdown, if shutdown cleanly should be very quick restart time.
Looping in @wallyqs and @caleblloyd who may have more insight etc.
What version of HELM chart are you running?
We are using nats chart 0.17.0
Let's see what the others have to say.
I am also seeing this issue with trying to upgrade to chart version 0.18.1 ...reverting back to 0.18.0 and still see this error.
edit 1: after few more retries of restarts, scaling up/down of stateful set ....upgrading to chart 0.18.1 now works...so I have no idea lol
edit 2: seems whenever this error occurs, if I scale statefulset to 1 then back up to 2 then again back to 3....it works.... I am running clustering with 3 replicas
Will see what @wallyqs or @caleblloyd have to say, I am not up on exact changes in the HELM charts.
Hi @tamalsaha do you have an idea of around how long did it get stuck during startup after disabling the health checks? These are the current defaults in the Helm charts which would total in about 5 minutes before causing a restart during the startup phase: https://github.com/nats-io/k8s/blob/main/helm/charts/nats/values.yaml#L83-L90
Not sure completely. node-0 took the longest and it was in that state 10 mins or more. I also saw the reloader container to restart during that time.
I too am seeing this on 2.9.3 and and helm 0.18.0. I had a large built-up stream needing flushing, it took over an hour of pod restarts before it finally came up
Was this a cascading restart, where all pods were restarted at once? Ideally each pod should be running on a different host, and only one pod should be restarted at a time.
I had added more resources to the stateful set, it stalled on the one box, hitting the liveness check issues. It feels like if the server is still syncing it should pass liveness but fail readiness. If the thing wasn't bouncing I have a feeling it'd have come up much faster
The best solution to prevent that would be to tune the startup probe initialDelaySeconds
, periodSeconds
and failureThreshold
so that it has enough time to catch-up on startup. The total startup time will be initialDelaySeconds + (periodSeconds * failureThreshold)
before the liveness and readiness probes start.
Everyone's will be a little different depending on hardware and stream sizes, so it would be best to benchmark with your specefic use case and then build in some headroom
@caleblloyd , I think the better option will be to disable these probes, since the startup time is unpredictable and can be hours.
Fwiw our jet streams were way backed up due to application issues with over 175m pending messages and I was hitting OOM. In the past (and ever since) I have had no problems similar to this, all the other nodes when this one finally completed restarted just fine well within the liveness probe limits. I am not sure this was a situation I could have tested against. I do believe live/ready checks are a good thing and should not be disabled, but in the emergency change case I got bit. Both suggestions would have done the trick, but of course that is in hind-sight.... I am all for doing the best/right thing by default if possible which is why I commented, to add to this discussion. It feels there is something that can be improved upon, to make NATS even more pain-free, auto-magical awesomeness than it already is.
Why were they so backed up? Did the applications not wait on publish acks from the system?
You can scale it down to 1 if nats stream info
responds, meaning it has a leader..
Was there an event that preceded the log entries?
We're seeing something very similar on our side and planned to rollout https://github.com/awesomenix/drainsafe but the project doesn't seem to be maintained so we decided against it.
Do you know when a fix will be available for this?
Believe this is resolved, and we are releasing 2.9.11 today that have proper healthz that can be used not only as startup probe but also liveness probe with the latest operator.
Closing for now but feel free to re-open as needed.
How do we debug
JetStream is not current with the meta leader
. We are using nats-server:2.9.3-alpine images. The issue is nats-server-0 pod will not go in ready state.