Resetting WAL state errors

juan131 commented 1 year ago

Defect

NATS servers logs are filled with warning such as the one below and I'm unable to find any documentation about what's the cause for this:

[37] [WRN] RAFT [cnrtt3eg - C-R3F-7al3bC6S] Resetting WAL state

Versions of `nats-server` and affected client libraries used:

NATS server version: 2.9.15
NATS client (Golang library: github.com/nats-io/nats.go v1.19.0)

OS/Container environment:

Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named nats-0, nats-1 & nats-2). Each replica has its own PVC.

NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:

server_name: $NATS_SERVER_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222

# Logging options
debug: false
trace: false
logtime: false

# Some system overrides
max_connections: 10000
max_payload: 65536

# Clustering definition
cluster {
  name: "nats"
  listen: 0.0.0.0:6222
  routes = [
    nats://nats:6222
  ]
}

# JetStream configuration
jetstream: enabled
jetstream {
  store_dir: /data/jetstream
  max_memory_store: 3G
  max_file_store: 20G
}

There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:

             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 12h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 12h0m0s
 Maximum Message Size: 1.0 KiB
    Maximum Consumers: unlimited

Steps or code to reproduce the issue:

The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn't stop (I can see 20K logs like this one in the last 12 hours for instance).

Expected result:

The system should tolerate the loss of one NATS server according to JetStream documentation given we're using a replication factor of 3.

Actual result:

Some streams are totally unusable when this happens (publishers can't add new message & subscriber don't receive new messages) while other streams seem to be working as expected.

derekcollison commented 1 year ago

We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.

In the meantime, you have two general approaches to repairing the raft layer.

Have the leader of the asset in question stepdown. You can track down the mappings from the raft layer name to the named asset using /jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true. Then issue nats consumer cluster step-down <stream> <consumer>
Scale the owning stream down to 1 and back up to the previous replica count. nats stream update --replicas=1 <stream>

juan131 commented 1 year ago

Thanks so much for the information @derekcollison !! I hope the new release helps mitigating these issues 🤞🏼

A couple of follow-up questions:

Would it be possible to perform a rolling update on NATS pods (setting maxUnavailable=1) with my current configuration & trust on the system to continue working?
Is there anything I can do to improve the stability of the system?

derekcollison commented 1 year ago

Feel free to test the RC.7 candidate for 2.9.16, it is under synadia's docker hub for nats-server.

In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for /haelthz to return ok, then move to the next server.

juan131 commented 1 year ago

Thanks @derekcollison ! I'll give the new version a try.

One odd thing that I noticed is that, sometimes, a consumer gets "stuck" when one of the NATS servers is restarted (usually when it's the leader the one that's restarted). I mean, no messages are delivered to its subscribers even though there are millions of messages pending in the stream.

Running nats consumer cluster step-down <stream> <consumer> doesn't help. However, it does help removing the consumer (nats consumer cluster rm <stream> <consumer>), create it again & recreate the subscriptions.

Note: I use explicit ack push consumers with a queue group of subscribers. The subscriptions are "channel-based" created using ChanQueueSubscribe function.

Do you have any explanation for this behaviour? Is there anything I can do to detect a consumer is "stuck"?

juan131 commented 1 year ago

Related to the behaviour I mentioned above. I noticed in the Grafana dashboard I have for monitoring JetStream (based on this dashboard) that, when this happens, the number of messages with ACK pending soars for that consumer, see:

Screenshot 2023-04-11 at 09 16 49

How can this happens if I'm setting Max Ack Pending: 100 on the consumer?

$ nats consumer info FOO-queue FOO-BAR-workers
Configuration:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers
          Ack Policy: Explicit
            Ack Wait: 10s
       Replay Policy: Instant
  Maximum Deliveries: 10
     Max Ack Pending: 100
        Flow Control: false

Cluster Information:

                Name: nats
              Leader: nats-0
             Replica: nats-1, current, seen 0.00s ago
             Replica: nats-2, current, seen 0.00s ago

derekcollison commented 1 year ago

We would have to triage your specific setup to gain more insight.

You do have the ability to set max ack pending which we highly recommend doing for systems.

ggarri commented 1 year ago

Hi @derekcollison , we experienced the same issue on our side, but in our case, we use 5 replicas for the workqueue stream, and in the latest version 2.9.19

In addition, we got one message "lost" in the queue, during this period where our cluster was unavailable.

derekcollison commented 1 year ago

So you had log messages about resetting WAL? For which servers? What did stream info report for the stream?

ggarri commented 1 year ago

Our WorkQueue streams leader was moved to a different instance right after that, but messages were lost in the transaction

derekcollison commented 1 year ago

Meaning messages that were in the work queue were lost on a leader transfer?

ggarri commented 1 year ago

How can I make sure of that? I only can tell the message was successfully sent by our NATs client and after the leader restart message was gone from every available replica.

derekcollison commented 1 year ago

ok and the message had not been properly delivered etc yes?

jnmoyne commented 1 year ago

Could it be that the problem is with the consumer running out of number of deliveries because of being a push consumer if there's no-one to listening to the delivery subject at the time (because one of the server's being restarted, some client applications are being disconnected and need to reconnect to other servers)? And the warning messages you are seeing on the server are related to the server restarting but just warnings?

I am saying this because of your consumer info:

                Name: FOO-BAR-workers
    Delivery Subject: _INBOX.5PwawLU8od7pwjLEwcfTPW
      Filter Subject: FOO.BAR
      Deliver Policy: All
 Deliver Queue Group: FOO-BAR-workers

Is it a push consumer with a queue group and your delivery subject is an _INBOX subject? That would not be the right thing to do.

I would recommend using pull consumers rather than push consumer on those working queue streams

ggarri commented 1 year ago

we are already using pull consumers since we identified in the past issues with the push consumers as you mentioned:

sub, err = s.dialer.JetStream.PullSubscribe(subject, s.subCfg.Name, s.subCfg.PullSubOpts()...)
if err != nil {
break
}

jnmoyne commented 1 year ago

We would need help being able to reproduce this because with an R3 consumer on an R3 stream you should be able to have one server down and still operate. We would need more details about the failure scenario for example, is it just one of the servers going down and then coming back up later or more than one server being down? Can it access its store_dir right away (each server must have its own independent storage directory and be able to access it). If there's any chance that more than one pod is restarting and there's overlap between their outage time consider using a 5+ nodes cluster and R5 for better fault-tolerance.

The reseting WAL state warning can happen for many reasons when you have servers going up and down and trying to recover their state through RAFT votes but should not result in lost messages.

For deployment over k8s we strongly recommend using the Helm chart (https://github.com/nats-io/k8s/releases/tag/nats-1.0.0-rc.1)

ro0NL commented 1 month ago

we experience the same using offical helm chart; in k8s we dont see any recent restarts or useful logs either. so im not sure how to debug it.

im betting though it's a memory/resource limit issue of some sort

nats-io / nats-server