nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

JetStream is not current with the meta leader #3563

Closed tamalsaha closed 1 year ago

tamalsaha commented 1 year ago

How do we debug JetStream is not current with the meta leader. We are using nats-server:2.9.3-alpine images. The issue is nats-server-0 pod will not go in ready state.

# http://localhost:8222/healthz

{
  "status": "unavailable",
  "error": "JetStream is not current with the meta leader"
}
derekcollison commented 1 year ago

This means the node has seen the metaleader but has not caught up yet. This should be a transient condition under normal circumstances.

tamalsaha commented 1 year ago

It seems to have resolved on its own. One thing we have noticed is that the startupProbe and the livenessProbe can cause problems when the pod in a situation like this for too long. The probe can fail the health check and the containers restart.

On restart, does the nats container resume from last run or from some checkpoint state?

We have seen that in case of databases, if a db container restarts while WAL recovery was running, on restart the database container will again start WAL recovery from the beginning. If probes are used in such cases and restarts db containers before the WAL recovery, the containers never recover. Does nats has any such issue? Because I removes the probes to avoid the nats containers from restarting while this recovery thing was going on. By nats recovery, I saw lot of logs like below:

[8] 2022/10/17 12:58:26.368413 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368417 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368419 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368423 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368426 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.368429 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.375907 [WRN] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Wrong index, ae is &{leader:dVlUzYpg term:53978 commit:23922 pterm:53978 pindex:23922 entries: 2}, index stored was 27970, n.pindex is 23922
[8] 2022/10/17 12:58:26.375930 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Truncating and repairing WAL
[8] 2022/10/17 12:58:26.375950 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] Canceling catchup subscription since we are now up to date
[8] 2022/10/17 12:58:26.375992 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry did not match 53978 23924 with 53978 23922
[8] 2022/10/17 12:58:26.376080 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
[8] 2022/10/17 12:58:26.376095 [DBG] RAFT [qcDOUgvA - S-R3F-ZGvNLjJE] AppendEntry ignoring old entry from previous catchup
derekcollison commented 1 year ago

Depends on how the NATS system was shutdown, if shutdown cleanly should be very quick restart time.

Looping in @wallyqs and @caleblloyd who may have more insight etc.

What version of HELM chart are you running?

tamalsaha commented 1 year ago

We are using nats chart 0.17.0

derekcollison commented 1 year ago

Let's see what the others have to say.

thiDucTran commented 1 year ago

I am also seeing this issue with trying to upgrade to chart version 0.18.1 ...reverting back to 0.18.0 and still see this error.

edit 1: after few more retries of restarts, scaling up/down of stateful set ....upgrading to chart 0.18.1 now works...so I have no idea lol

edit 2: seems whenever this error occurs, if I scale statefulset to 1 then back up to 2 then again back to 3....it works.... I am running clustering with 3 replicas

derekcollison commented 1 year ago

Will see what @wallyqs or @caleblloyd have to say, I am not up on exact changes in the HELM charts.

wallyqs commented 1 year ago

Hi @tamalsaha do you have an idea of around how long did it get stuck during startup after disabling the health checks? These are the current defaults in the Helm charts which would total in about 5 minutes before causing a restart during the startup phase: https://github.com/nats-io/k8s/blob/main/helm/charts/nats/values.yaml#L83-L90

tamalsaha commented 1 year ago

Not sure completely. node-0 took the longest and it was in that state 10 mins or more. I also saw the reloader container to restart during that time.

cbuckley01 commented 1 year ago

I too am seeing this on 2.9.3 and and helm 0.18.0. I had a large built-up stream needing flushing, it took over an hour of pod restarts before it finally came up

caleblloyd commented 1 year ago

Was this a cascading restart, where all pods were restarted at once? Ideally each pod should be running on a different host, and only one pod should be restarted at a time.

cbuckley01 commented 1 year ago

I had added more resources to the stateful set, it stalled on the one box, hitting the liveness check issues. It feels like if the server is still syncing it should pass liveness but fail readiness. If the thing wasn't bouncing I have a feeling it'd have come up much faster

caleblloyd commented 1 year ago

The best solution to prevent that would be to tune the startup probe initialDelaySeconds, periodSeconds and failureThreshold so that it has enough time to catch-up on startup. The total startup time will be initialDelaySeconds + (periodSeconds * failureThreshold) before the liveness and readiness probes start.

https://github.com/nats-io/k8s/blob/145f4662a4935e8326c8b55b5e585bde8bc62327/helm/charts/nats/values.yaml#L92-L99

Everyone's will be a little different depending on hardware and stream sizes, so it would be best to benchmark with your specefic use case and then build in some headroom

tamalsaha commented 1 year ago

@caleblloyd , I think the better option will be to disable these probes, since the startup time is unpredictable and can be hours.

cbuckley01 commented 1 year ago

Fwiw our jet streams were way backed up due to application issues with over 175m pending messages and I was hitting OOM. In the past (and ever since) I have had no problems similar to this, all the other nodes when this one finally completed restarted just fine well within the liveness probe limits. I am not sure this was a situation I could have tested against. I do believe live/ready checks are a good thing and should not be disabled, but in the emergency change case I got bit. Both suggestions would have done the trick, but of course that is in hind-sight.... I am all for doing the best/right thing by default if possible which is why I commented, to add to this discussion. It feels there is something that can be improved upon, to make NATS even more pain-free, auto-magical awesomeness than it already is.

derekcollison commented 1 year ago

Why were they so backed up? Did the applications not wait on publish acks from the system?

jocrau commented 1 year ago

We are seeing log entries on our development cluster that are similar to the ones tamalsaha posted. Is there a way to recover without disruptive measures, like deleting and recreating stream or scaling down and up again?

Helm Charts: 0.18.1 NATS: 2.9.4-beta

nats.log

derekcollison commented 1 year ago

You can scale it down to 1 if nats stream info responds, meaning it has a leader..

Was there an event that preceded the log entries?

kansarav commented 1 year ago

We're seeing something very similar on our side and planned to rollout https://github.com/awesomenix/drainsafe but the project doesn't seem to be maintained so we decided against it.

Do you know when a fix will be available for this?

derekcollison commented 1 year ago

Believe this is resolved, and we are releasing 2.9.11 today that have proper healthz that can be used not only as startup probe but also liveness probe with the latest operator.

Closing for now but feel free to re-open as needed.