[Jetstream] NATS chart deployment seems to break quorum "randomly" after about a week

kevin-lindsay-1 commented 1 year ago

Defect

As a user of OpenFaaS, NATS Jetstream is part of our stack. OpenFaaS handles async requests via JetStream.

We have noticed that after some time has passed, roughly once every week or two, quorum appears to break, and pods created by the NATS statefulset are not being brought up and down. PVC is stable, as well.

Load does not appear to be particularly high during this time, either, roughly ~1000 items added and removed from the queue in the span of about 10-15 minutes, so ~2000 mutations overall.

Eventually, we see this log start to repeat from 1 or more of the pods:

[26] 2023/04/12 14:15:07.186881 [WRN] JetStream cluster consumer '$G > faas-request > faas-workers' has NO quorum, stalled.
[26] 2023/04/12 14:15:07.248099 [WRN] JetStream cluster stream '$G > faas-request' has NO quorum, stalled

Quorum breaks, and this stream effectively "jams". We then just go in and restart the offending pod and things seem to come back online, no data appears to be lost.

Make sure that these boxes are checked before submitting your issue -- thank you!

[ ] Included nats-server -DV output
[ ] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Maybe I'm not creating this issue in the correct repo, because nats-server -dv doesn't appear to be the command I would expect to use here. I have server logs, but they're in datadog. There's ways to get them to the NATS team, though. Feel free to move this issue or guide me to what you need.

As for the MCVE, this issue is pretty hard to reliably reproduce in a timely manner, because quorum just kinda eventually breaks.

Versions of `nats-server` and affected client libraries used:

nats:2.9.15-alpine
not sure about the client libraries; OpenFaaS would be able to answer that.

OS/Container environment:

AWS EKS cluster @ k8s:1.26

Steps or code to reproduce the issue:

Expected result:

If no pods are moved from the statefulset, I would generally not expect quorum to break.

Actual result:

Quorum seems to break with seemingly no infrastructure instability and relatively low operations.
Quorum doesn't seem to fix itself; manual intervention is needed, by terminating (what appears to be) the offending pod.

alexellis commented 1 year ago

not sure about the client libraries; OpenFaaS would be able to answer that.

@kevin-lindsay-1 - from go.mod - github.com/nats-io/nats.go v1.24.0 - assuming you're on the latest build of the queue worker / gateway.

derekcollison commented 1 year ago

Grab latest server release as well. 2.9.16.

kevin-lindsay-1 commented 1 year ago

Grab latest server release as well. 2.9.16.

Done.

derekcollison commented 1 year ago

Quorum should not break, so something else was going on. If you see this happen again ping us on Slack or here and let's jump on a Zoom/GH call.

derekcollison commented 1 year ago

You using latest helm charts as well yes?

kevin-lindsay-1 commented 1 year ago

You using latest helm charts as well yes?

0.19.13

kevin-lindsay-1 commented 1 year ago

Quorum should not break, so something else was going on. If you see this happen again ping us on Slack or here and let's jump on a Zoom/GH call.

I also removed istio sidecars from everything in the OF namespace, and verified that everything continues to work as expected, in order to remove additional moving parts.

kevin-lindsay-1 commented 1 year ago

I've noticed a new behavior today wherein quorum doesn't explicitly break, but for some reason certain streams just get "stuck". The consumer (OF Queue Worker) seems to have been interrupted when working on messages, and for some reason rebooting the NATS instances "un-jams" the system, causing messages to be processed again.

It's not clear to me what's going on here, as neither OF QW nor NATS really appear to be logging anything of interest related to "bad state" or something going wrong, but in the last 24 hours I've had 2 full "jams" of the streams, where certain jobs stop getting processed for hours at a time.

derekcollison commented 1 year ago

We would need to look at your application code. Also nats consumer info is your friend here, along with nats stream info

kevin-lindsay-1 commented 1 year ago

We would need to look at your application code. Also nats consumer info is your friend here, along with nats stream info

OF is directly touching NATS, my code is not.

derekcollison commented 1 year ago

What is OF again?

kevin-lindsay-1 commented 1 year ago

OpenFaaS

derekcollison commented 1 year ago

Ah apologies, them someone needs to look at OpenFaaS code base and how it is handling getting messages from the WQ. The server signals, even a new server that you might reconnect to that the pull request is no longer valid etc.

Do you work with OpenFaas or are a user/customer of theirs?

kevin-lindsay-1 commented 1 year ago

OpenFaaS uses NATS, and somewhat recently NATS JetStream, we use it by that association. We are a customer of OpenFaaS.

derekcollison commented 1 year ago

ok, I am not familiar with the OpenFaaS code and how they interact with NATS JetStream, so probably need to loop in @alexellis

kevin-lindsay-1 commented 1 year ago

We (Alex and I) are investigating the issue. It's proving hard to reproduce but it's reliable on a long term basis; it seems like messages are succeeding to OF, which you would expect an ACK to have occurred, but messages appear to be being retried even though their workload appeared to have succeeded.

Additionally, it seems like certain messages also simply don't republish when their ACK_WAIT is expiring. It's hard to tell exactly what's going on, because neither systems are producing logs that are telling me exactly what is going on, and I see no errors from either system.

At this time, it seems to me like NATS JetStream is in certain bursts, for specific messages, neither responding to ACKs on its side nor republishing messages when ACK_WAIT expires, leading me to believe that certain messages in NATS are "getting stuck", and it seems like nothing in the OF side of the relationship has any knowledge in order to determine that something is "wrong".

Restarting the NATS pods seems to immediately clear this issue up, leading me to believe that something in NATS is getting into a bad state or something.

derekcollison commented 1 year ago

We would need to spend time triaging the issue and understanding the OpenFaaS codebase tbh.

Restarting the server is fixing it but not, imo, because the state in the server is being fixed, I think that is probably fine, I think it is causing OpenFaaS to reconnect and reset its state.

What server version? What Go version of the client is OpenFaaS using?

derekcollison commented 1 year ago

Upgrade to 2.9.16 for the server if you could so we know you are on the latest.

kevin-lindsay-1 commented 1 year ago

Upgrade to 2.9.16 for the server if you could so we know you are on the latest.

I already did that and told you so.

kevin-lindsay-1 commented 1 year ago

What server version? What Go version of the client is OpenFaaS using?

Both of these were already told to you.

derekcollison commented 1 year ago

ok apologies, lots going on for us so hard to keep track of all GH conversations at once.

Have you pinged @alexellis for support?

kevin-lindsay-1 commented 1 year ago

Have you pinged @alexellis for support?

Yes, as I said, he and I are investigating and trying to produce a specific repro, but it's proving difficult because the issue seems to just happen without a whole lot of consistency of underlying infrastructure or anything. It's almost like NATS just randomly breaks, but I can't tell you much more because I don't have any logs from either showing an issue. Both sides say nothing is wrong, but something is wrong.

kevin-lindsay-1 commented 1 year ago

Restarting the server is fixing it but not, imo, because the state in the server is being fixed, I think that is probably fine, I think it is causing OpenFaaS to reconnect and reset its state.

I will need to specifically target this exact case, but on previous troubleshooting attempts restarting the OF QW (NATS client) didn't seem to fix the issue. disconnecting and reconnecting doesn't seem to be the problem here, although resetting NATS state seems to resolve this issue, although I might expect some messages to get stuck again and need multiple restarts if there are enough messages.

In other words, currently it seems like server-side state is getting screwed up or the client is doing something wrong and assuming the server should be doing something that it shouldn't. Considering that OF used to use the previous implementation of NATS just fine, this seems somewhat less likely to me; it seems like a new client is less complicated than a new server.

derekcollison commented 1 year ago

Again, nats stream info and nats consumer info are your friend, as well as from a system user perspective nats server ls and nats server report jetstream.

alexellis commented 1 year ago

Hi @derekcollison we've had a look at the stream and consumer info :+1:

The other two commands aren't working for me on a port-forwarded nats server.

We use a pull subscriber fetching one message in its own Go routine. Pull a message off, invoke the function via our gateway, then either ack it or nack it depending on the result - meanwhile sending inprogress on the message during the execution to extend the ack window. The default is set to 3m for the ack window.

I'm happy to screen share with someone and walk through the code to see if there's something we've got misconfigured.

Alex

kevin-lindsay-1 commented 1 year ago

I have been testing various NATS helm chart configuration options, and I will report back on my findings after conducting further tests in order to attempt to pinpoint this issue.

derekcollison commented 1 year ago

Let's do a screen share, I have some time tomorrow or next week.

kevin-lindsay-1 commented 1 year ago

I'm still evaluating where the error (or errors) is. There's multiple layers; AWS, K8S, NATS, OF, and my Functions. I found a small "bug" in OF that confounds things. I next need to work around it and evaluate if this still occurs without the confounding factor.

I am also testing things such as cluster vs no cluster, storage vs memory, and so on.

I will screen share if my report is inconclusive, but until then I'm not far enough along to have conclusive evidence that the issue is indeed NATS.

derekcollison commented 1 year ago

ok keep us posted.

kevin-lindsay-1 commented 1 year ago

After working on producing replication steps for this, I identified a "bug" in OpenFaaS (really a feature that is needed). Once I did a workaround for this issue, I have been unsuccessful on reproducing this issue.

I'm closing this issue for now, because at the moment it appears that this issue currently lives on the OF side.

AntPAllen commented 1 year ago

Hi @kevin-lindsay-1 , would you mind elaborating on what the bug was in the OF side? We have also experienced similar issues and would like to rule out the usage on the client side.

alexellis commented 1 year ago

There was no bug in OpenFaaS.

Kevin had told our queue-worker to not retry 500 status codes from functions, so it looked like invocations were "going missing", when it was working as designed.

The quorum error is still unknown and unrelated to OpenFaaS.

From what I understood - Kevin was using spot instances for NATS, so the various 3/3 replicas of the NATS server could go down - potentially at the same time.

Hope this helps @AntPAllen

Thanks @derekcollison for the input too.

nats-io / nats-server

[Jetstream] NATS chart deployment seems to break quorum "randomly" after about a week #4085