nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

When re-deploy nats-jetstream have problem `NO quorum, stalled.` #2737

Closed duc2h closed 1 year ago

duc2h commented 2 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

2.6.6-alpine3.14

OS/Container environment:

Steps or code to reproduce the issue:

re-deploy nats cluster

Expected result:

deploy success

Actual result:

2021-12-08 16:45:15.695 ICT[1] 2021/12/08 09:45:15.694914 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-sync-staff' has NO quorum, stalled.
Error
2021-12-08 16:45:16.904 ICT[1] 2021/12/08 09:45:16.904527 [WRN] JetStream cluster stream 'A > studenteventlogs' has NO quorum, stalled.
Error
2021-12-08 16:45:18.250 ICT[1] 2021/12/08 09:45:18.249982 [WRN] JetStream cluster consumer 'A > eurekastudentevent > durable-eureka-student-event-created' has NO quorum, stalled.
Error
2021-12-08 16:45:18.414 ICT[1] 2021/12/08 09:45:18.414104 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-class' has NO quorum, stalled.
Error
2021-12-08 16:45:18.565 ICT[1] 2021/12/08 09:45:18.564943 [WRN] JetStream cluster stream 'A > syncmasterregistration' has NO quorum, stalled.
Error
2021-12-08 16:45:19.506 ICT[1] 2021/12/08 09:45:19.506590 [WRN] JetStream cluster stream 'A > activitylog' has NO quorum, stalled.
Error
2021-12-08 16:45:19.717 ICT[1] 2021/12/08 09:45:19.717403 [WRN] JetStream cluster consumer 'A > learningobjectives > durable-learning-objectives-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.278 ICT[1] 2021/12/08 09:45:20.278274 [WRN] JetStream cluster consumer 'A > studenteventlogs > durable-student-event-logs-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.506 ICT[1] 2021/12/08 09:45:20.505963 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-log-payload' has NO quorum, stalled.
Error
2021-12-08 16:45:20.687 ICT[1] 2021/12/08 09:45:20.686837 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-academic-year' has NO quorum, stalled.
Error
2021-12-08 16:45:20.805 ICT[1] 2021/12/08 09:45:20.805577 [WRN] JetStream cluster consumer 'A > cloudconvertjobevent > durable-cloud-convert' has NO quorum, stalled.
Error
2021-12-08 16:45:22.363 ICT[1] 2021/12/08 09:45:22.363649 [WRN] JetStream cluster consumer 'A > activitylog > durable-activity-log-created' has NO quorum, stalled.
Error
2021-12-08 16:45:22.601 ICT[1] 2021/12/08 09:45:22.601333 [WRN] JetStream cluster stream 'A > learningobjectives' has NO quorum, stalled.
Error
2021-12-08 16:45:23.257 ICT[1] 2021/12/08 09:45:23.257704 [WRN] JetStream cluster consumer 'A > assignstudyplan > durable-assign-study-plan' has NO quorum, stalled.
Error
2021-12-08 16:45:24.043 ICT[1] 2021/12/08 09:45:24.043701 [WRN] JetStream cluster stream 'A > syncusercourse' has NO quorum, stalled.
Error
2021-12-08 16:45:24.807 ICT[1] 2021/12/08 09:45:24.807032 [WRN] JetStream cluster stream 'A > chatmessage' has NO quorum, stalled.
Error
2021-12-08 16:45:24.989 ICT[1] 2021/12/08 09:45:24.989169 [WRN] JetStream cluster stream 'A > assignstudyplan' has NO quorum, stalled.
Error
2021-12-08 16:45:26.588 ICT[1] 2021/12/08 09:45:26.588135 [WRN] JetStream cluster consumer 'A > studentpackage > durable-student-package' has NO quorum, stalled.
Error
2021-12-08 16:45:28.324 ICT[1] 2021/12/08 09:45:28.323815 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-course-class' has NO quorum, stalled.
derekcollison commented 2 years ago

Can you give details about how many servers in your cluster? What is the replication factor of the streams and consumers?

rh2048 commented 2 years ago

I've ran in to this issue with a 5-node cluster. I can't seem to reproduce this issue consistently, but this issue seemed to be correlated with an inconsistency in the reported cluster size from each server after restarting a node (see #2657).

We were unable to resolve this issue and get the cluster size to report the correct size consistently. Instead, we moved to a 3-node cluster and haven't had issues since.

nvcnvn commented 2 years ago

We seem having same issue with same version.

Not sure what we can provide to debug?

tommylp commented 2 years ago

Same problem? Cluster Size: 3 Nats: 2.7.4 Using Nats Security with distributed JWT's

Updated to a new Version (2.7.4) from 2.6.6. Before upgrading a nats backup of the streams where performed, but now unable to restore the streams. Getting Error:

[WRN] JetStream cluster stream 'AD2XXTUQI453QTLRZYHP4O2NGKPUMI6T22MGKKUWADO3IS6W226NQZX7 > <stream>' has NO quorum, stalled

Check if the stream exist:

nats -s <server> --creds <credsFile> stream report
Obtaining Stream stats
No Streams defined

If I try to create the same stream again after the failed restore I'm getting this:

nats -s <server> --creds <credsFile> stream create <streamName>
? Subjects to consume <topic>.>
? Storage backend file
? Retention Policy Limits
? Discard Policy Old
? Stream Messages Limit -1
? Message size limit -1
? Maximum message age limit 3M
? Maximum individual message size -1
? Duplicate tracking time window 5m
? Replicas 2
nats: error: could not create Stream: malformed or corrupt message
ripienaar commented 2 years ago

please do your create command with --trace and show the output

tommylp commented 2 years ago
12:20:23 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}

12:20:23 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}

nats: error: could not create Stream: malformed or corrupt message

Setting the replication count to: 1 will create the stream

ripienaar commented 2 years ago

your subject appears to end in foo.>> can only have 1.

assuming you have a system account use that and show nats server list and nats report jsz

tommylp commented 2 years ago

Rerun with 1 >

nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:25:47 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}

12:25:47 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}

nats: error: could not create Stream: malformed or corrupt message
tommylp commented 2 years ago

Content of config

{
  "name": "<stream>",
  "subjects": [
    "<topic>.\u003e"
  ],
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs": -1,
  "max_bytes": -1,
  "max_age": 7776000000000000,
  "max_msg_size": -1,
  "storage": "file",
  "discard": "old",
  "num_replicas": 2,
  "duplicate_window": 300000000000
}
tommylp commented 2 years ago

server list

+-----------------------------------------------------------------------------------------------------------------------------------+
|                                                          Server Overview                                                          |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| Name        | Cluster    | IP        | Version | JS  | Conns | Subs | Routes | GWs | Mem    | CPU | Slow | Uptime   | RTT         |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| nats-core-1 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 0     | 302  | 2      | 0   | 17 MiB | 0.0 | 0    | 27m22s   | 71.283698ms |
| nats-core-0 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 14    | 366  | 2      | 0   | 26 MiB | 0.0 | 0    | 1h59m34s | 71.246344ms |
| nats-core-2 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 5     | 328  | 2      | 0   | 17 MiB | 0.0 | 0    | 29m51s   | 71.193031ms |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
|             | 1 Clusters | 3 Servers |         | 3   | 19    | 996  |        |     | 60 MiB |     | 0    |          |             |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+

+------------------------------------------------------------------------------+
|                               Cluster Overview                               |
+-----------+------------+-------------------+-------------------+-------------+
| Cluster   | Node Count | Outgoing Gateways | Incoming Gateways | Connections |
+-----------+------------+-------------------+-------------------+-------------+
| nats-core | 3          | 0                 | 0                 | 19          |
+-----------+------------+-------------------+-------------------+-------------+
|           | 3          | 0                 | 0                 | 19          |
+-----------+------------+-------------------+-------------------+-------------+
tommylp commented 2 years ago

server report jsz:

+-------------------------------------------------------------------------------------------------------+
|                                           JetStream Summary                                           |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| Server       | Cluster   | Streams | Consumers | Messages | Bytes | Memory | File | API Req | API Err |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| nats-core-2* | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 8       | 7       |
| nats-core-0  | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 2       | 0       |
| nats-core-1  | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 6       | 0       |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
|              |           | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 16      | 7       |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+

+--------------------------------------------------------+
|              RAFT Meta Group Information               |
+-------------+--------+---------+--------+--------+-----+
| Name        | Leader | Current | Online | Active | Lag |
+-------------+--------+---------+--------+--------+-----+
| nats-core-0 |        | true    | true   | 0.22s  | 0   |
| nats-core-1 |        | true    | true   | 0.22s  | 0   |
| nats-core-2 | yes    | true    | true   | 0.00s  | 0   |
+-------------+--------+---------+--------+--------+-----+
ripienaar commented 2 years ago

so if you just change your config to replicas 1 it works? (config is valid now)

tommylp commented 2 years ago

Yes, no problem.

nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:32:41 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":1,"duplicate_window":300000000000}

12:32:41 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","config":{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msgs_per_subject":-1,"max_msg_size":-1,"discard":"old","storage":"file","num_replicas":1,"duplicate_window":300000000000,"sealed":false,"deny_delete":false,"deny_purge":false,"allow_rollup_hdrs":false},"created":"2022-03-28T10:32:41.197592339Z","state":{"messages":0,"bytes":0,"first_seq":0,"first_ts":"0001-01-01T00:00:00Z","last_seq":0,"last_ts":"0001-01-01T00:00:00Z","consumer_count":0},"cluster":{"name":"nats-core","leader":"nats-core-1"},"did_create":true}

Stream <stream> was created

Information for Stream <stream> created 2022-03-28T12:32:41+02:00

Configuration:

             Subjects: <topic>.>
     Acknowledgements: true
            Retention: File - Limits
             Replicas: 1
       Discard Policy: Old
     Duplicate Window: 5m0s
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 90d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats-core
               Leader: nats-core-1

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 0
              LastSeq: 0
     Active Consumers: 0
tommylp commented 2 years ago

Unfortunately setting the num_replicas to 1 in the backup.json file did not solve the problem.

nats: error: restore failed: malformed or corrupt message
tommylp commented 2 years ago

Do I need to update the value inside the base64 configuration also? But that will probably not work, because of a change of the checksum value.

tommylp commented 2 years ago

How is the checksum created?

{
  "type": "stream",
  "time": "2022-03-28T08:27:07Z",
  "configuration": "<base64String>",
  "checksum": "31423daa92ee............"
}
ripienaar commented 2 years ago

I think you can do —replicas when restoring rather than editing the file

tommylp commented 2 years ago

No --replicas flag that I can see. The nats stream restore command have a --config flag that can take a config file. But did now work either.

tommylp commented 2 years ago

Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.

Doing the same for stream restore still does not work.

tommylp commented 2 years ago

Related issue: https://github.com/nats-io/nats-server/issues/2845

ajax-lizogubenko-s commented 2 years ago

Also experiencing such problem on nats v2.7.4 with: 3-node cluster, 3-replicas per stream 10k subjects 10k push consumers (one per subjects) spread among 10 to 20 streams (it doesn't matter).

After cluster restart a lot of consumers (but not all) has no quorum and become stalled.

nvcnvn commented 2 years ago

Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.

Doing the same for stream restore still does not work.

We're facing the same error when we create new node pool and evict nats streaming.

What is the procedure to work around this? When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.

derekcollison commented 2 years ago

@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.

nvcnvn commented 2 years ago

@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.

Hi @derekcollison thanks for your great support, how should we arrange this?

derekcollison commented 2 years ago

Shoot me an email, derek@nats.io.

cchatfield commented 2 years ago

@derekcollison

I received the same error for consumer stalled.

JetStream cluster consumer '$G > configuration > admin_CreateAdminUserCommand_firebase_CreateAdminUser' has NO quorum, stalled.
Healthcheck failed: "JetStream consumer '$G > configuration > admin_AdminUserCreatedEvent_firebase_AdminUserCreated' is not current"

This is a 5 node cluster and the stream was set to a replicas -> 3

Version 2.8.2 - k8s - attached pvc

I changed the replicas to 5 and the cluster became stable again.

Can you tell me how quorum is calculated for a consumer with a replica of 3 in a 5 node cluster? The only doc for quorum I could find was https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering#the-quorum.

If the same calc for quorum is 1/2 node +1, then I assume that quorum won't be reached if a node in the 5 node cluster drops that had the info on a consumer (replicas 3). Is this valid or I am off base?

derekcollison commented 2 years ago

2.8.3 should be released tomorrow which hopefully helps out here.

Quorum calculation is N/2+1. So for R3 its 2, for R5 its 3.

derekcollison commented 1 year ago

Closing for now but feel free to re-open as needed.

osmanovv commented 1 year ago

Unfortunately setting the num_replicas to 1 in the backup.json file did not solve the problem.

@tommylp, you should just change that property on existent stream (without backup-restore settings):

nats stream edit <STREAM_NAME> --replicas 3

When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.

@nvcnvn, instead of deleting the entire stream just try to updating replicas value in it:

nats stream edit <STREAM_NAME> --replicas 1
nats stream edit <STREAM_NAME> --replicas 3

It would re-create replicas on the available servers.