nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

stream didn't work if I shut down 2 nodes for 5-nodes-cluster #3642

Closed SiminMu closed 1 year ago

SiminMu commented 1 year ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

version: 2.8.4 library: golang library of nats

OS/Container environment:

Redhat8

Steps or code to reproduce the issue:

I followed the instruction here and set up a cluster with 5 nodes, when I shut down one node, everything looked good, when I shutdown the second one, I found that the cluster worked but the stream didn't work(the nats-server log said "JetStream cluster no metadata leader"). As what I understood, the stream should work if we shut down less that 3 nodes, right?

the stream info:

Configuration:

          Description: Stream used for persisting messages
             Subjects: config.>
     Acknowledgements: true
            Retention: File - Interest
             Replicas: 5
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: 5,000
  Maximum Per Subject: 10
        Maximum Bytes: 1.0 GiB
          Maximum Age: 1h0m0s
 Maximum Message Size: 10 MiB
    Maximum Consumers: 5000

Cluster Information:

                 Name: nats-cluster
               Leader: server0
              Replica: server1, current, seen 0.98s ago
              Replica: server2, current, seen 0.97s ago
              Replica: server3, current, seen 0.97s ago
              Replica: server4, current, seen 0.98s ago

the config file of one of the node is:

[root@server ~]# cat nats.conf
server_name: server0
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://sever1:6222
    nats-route://sever2:6222
    nats-route://sever3:6222
    nats-route://sever4:6222
  ]
}

after I shut down 2 nodes, the nats command line: nats: error: could not pick a Stream to operate on: JetStream system temporarily unavailable (10008)

the log of nats-server: JetStream cluster no metadata leader

Expected result:

cluster and stream still worked if we shut down 1 or 2 nodes in 5

Actual result:

cluster worked, but stream didn't work if we shut down 2 nodes in 5

derekcollison commented 1 year ago

The system and stream should work fine.

First let's get you upgraded to a recent server version, latest is 2.9.6 but 2.9.7 coming today.

If you still see the issue, posy all server configs here, how you deploy the servers, and how you shut them down.

If one of the servers was the stream leader make sure you see a log entry for a still running server that shows someone took over as the new stream leader.

SiminMu commented 1 year ago

@derekcollison Thanks for your reply, I upgraded to 2.9.7 but still got the same error, here is the information you requested(I hide/change some information as our compony policy, hopefully I won't confuse you :) ):

Q: How I deploy the servers? A: I used ansible script to deploy nats as systemd service on 5 nodes, here is the service file of one(others are similar):

# /etc/systemd/system/nats-server.service
[Unit]
Description=Nats server for message bus
Documentation=https://github.com/nats-io/nats-server
After=network.target

[Service]
ExecStart=/usr/sbin/nats-server -c /local/nats-server/conf/nats.conf

Group=nats-user
User=nats-user
Restart=always
RestartSec=15
TimeoutStopSec=20s
WorkingDirectory=/local/nats-server

TasksMax=4000
MemoryLimit=32G
CPUQuota=800%

IOSchedulingClass=best-effort
IOSchedulingPriority=0

PrivateTmp=yes
UMask=0077

[Install]
WantedBy=multi-user.target

Q: how you shut them down A: I used systemctl command to shut down 2 nodes one by one, both of the two nodes were not the cluster leader(I am not if they were stream leader as I don't know how to check).

systemctl stop nats-server

After I shutdown the first one, I observed everything worked(nats str ls & nats str info & my golang code which do subscription(client side) and publish(servers side)), the nats server log said:

server3:58128 - rid:56 - Router connection closed: Client Closed
JetStream cluster new metadata leader: server2/cos-nats-cluster
Error trying to connect to route (attempt 257): dial tcp server4:6222: connect: connection refused

After I shut down the second node, I observed nats str info config worked and list three alive nodes and two dead nodes, nats str ls didn't work, it said:

nats: error: could not list streams: JetStream system temporarily unavailable (10008), try --help

my golang client still worked, it could receive nats message when I used golang server published a message, but after I restarted my golang client, it couldn't subscribe any longer and the log:

nats: no stream matches subject

For the code to do subscription for golang client, I was using method:

nats.js.ChanSubscribe(subject, msgChan, nats.Durable(consumerName), nats.Description(description), nats.ManualAck(), nats.IdleHeartbeat(10*time.Second), nats.MaxDeliver(5), nats.DeliverNew(), nats.DeliverSubject(deliverSubject))

the log of nats server said(I didn't see a new stream leader was elected from the log):

server4:40698 - rid:62 - Router connection closed: Client Closed
JetStream cluster no metadata leader
Error trying to connect to route (attempt 257): dial tcp server3:6222: connect: connection refused
Error trying to connect to route (attempt 257): dial tcp server4:6222: connect: connection refused

Q: all server configs A: As I used ansible template to generate the config file, so the other 4 configs look like the one I pasted previously and I didn't find any typo, I pasted them here for your reference:

=========================server0=========================
[root@server0 ~]# cat nats.conf
server_name: server0
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://<sever1_ip>:6222
    nats-route://<sever2_ip>:6222
    nats-route://<sever3_ip>:6222
    nats-route://<sever4_ip>:6222
  ]
}
=========================server1=========================
[root@server1 ~]# cat nats.conf
server_name: server1
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://<sever0_ip>:6222
    nats-route://<sever2_ip>:6222
    nats-route://<sever3_ip>:6222
    nats-route://<sever4_ip>:6222
  ]
}
=========================server2=========================
[root@server2 ~]# cat nats.conf
server_name: server2
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://<sever0_ip>:6222
    nats-route://<sever1_ip>:6222
    nats-route://<sever3_ip>:6222
    nats-route://<sever4_ip>:6222
  ]
}
=========================server3=========================
[root@server3 ~]# cat nats.conf
server_name: server3
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://<sever0_ip>:6222
    nats-route://<sever1_ip>:6222
    nats-route://<sever2_ip>:6222
    nats-route://<sever4_ip>:6222
  ]
}
==========================server4========================
[root@server4 ~]# cat nats.conf
server_name: server4
port: 4222
http_port: 8222

tls: {
  cert_file: "wildcard.pem"
  key_file: "wildcard.key"
}

authorization: {
    users = [
        {
            user: user1
            permissions: {
                publish: {
                    allow: ["$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>"]
                },
                subscribe: {
                    allow: ["config.>", "delivery.>", "$JS.ACK.>", "$JS.API.INFO", "$JS.API.STREAM.INFO.>", "$JS.API.STREAM.NAMES", "$JS.API.STREAM.MSG.GET.>", "$JS.API.CONSUMER.>", "$JS.FC.>", "_INBOX.>"]
                }
            }
        }
        {
            user: user2
            password: "*****"
            permissions: {
                publish: {
                    allow: ["config.>", "delivery.>", "$JS.>"]
                },
                subscribe: {
                    allow: ["$JS.>", "_INBOX.>"]
                }
            }
        }
    ]
}

jetstream {
   store_dir: data
   max_memory_store: 1073741824
   max_file_store: 10737418240
}

cluster {
  name: nats-cluster
  listen: 0.0.0.0:6222
  routes: [
    nats-route://<sever0_ip>:6222
    nats-route://<sever1_ip>:6222
    nats-route://<sever2_ip>:6222
    nats-route://<sever3_ip>:6222
  ]
}

Other info maybe useful: I observed that although all of the nodes worked, I still got some abnormal logs from nats server:

......
  Recovering 1 consumers for stream - '$G > config'
    Missing consumer metafile "/local/nats-server/data/jetstream/$G/streams/config/obs/spiffe:/meta.inf"
......
server2:6222 - rid:9 - Route connection created
server0:6222 - rid:10 - Route connection created
server0:57038 - rid:12 - Route connection created
server0:57038 - rid:12 - Router connection closed: Duplicate Route
server4:6222 - rid:17 - Route connection created
server3:6222 - rid:18 - Route connection created
server3:40684 - rid:28 - Route connection created
server3:40684 - rid:28 - Router connection closed: Duplicate Route
server4:58102 - rid:30 - Route connection created
server4:58112 - rid:31 - Route connection created
server3:40686 - rid:32 - Route connection created
server4:58102 - rid:30 - Router connection closed: Duplicate Route
server4:58112 - rid:31 - Router connection closed: Duplicate Route
server3:40686 - rid:32 - Router connection closed: Duplicate Route
......
derekcollison commented 1 year ago

What does nats server report jetstream show? This has to be invoked with a system user context.

SiminMu commented 1 year ago

I executed the command "nats server report jetstream" using user2 but got error:

nats: Permissions Violation for Publish to "$SYS.REQ.SERVER.PING.JSZ"

Even I added "$SYS.>" into permissions.subscribe.allow & permissions.publish.allow of user2, it still didn't work:

nats: error: server request failed, ensure the account used has system privileges and appropriate permissions, try --help

Could you please guide me how to improve the config file to execute that command?

SiminMu commented 1 year ago

OK, I found the document and I added the system account into my config file:

accounts: {
  MYSYS: {
    users: [{user: ***, password: ***}]
  }
}

system_account: MYSYS

I saw "server report jetstream" command could show the jetstream leader. I wanted to reproduce the issue but I found it disappeared! I randomly deleted the node, but nats worked every time(even I deleted the cluster leader or stream leader)... :)

I don't know why I was always failed at the beginning... I will try more and see if I can reproduce this issue again...

Thanks for your support!

derekcollison commented 1 year ago

These commands need to be issued from a system account user. By default all nats-servers create a system account, $SYS, but they do not create any default users, since this could lead to security issues. If you have a server config setup, vs operator mode, you can add in a system user with the following addition to the configuration file.

    # For access to system account.
    accounts { $SYS { users = [ { user: "admin", pass: "s3cr3t!" } ] } }
derekcollison commented 1 year ago

What most likely happened is that you introduced a new peer into the system, so the system thought it had 6 peers vs 5. Quorum for a system is NumPeers/2+1, so with the addition of a 6th peer, the system needed 4 peers to elect a meta-leader. The stream itself was fine, but any API request to the meta layer would fail.

The command I listed above would have shown you that, and you could have removed the server/peer to return to 5 peers for the system.

SiminMu commented 1 year ago

Apologize for adding a new comment for this closed issue. Thanks @derekcollison , you are right, that maybe why I got this issue, I indeed introduced a new peer to replace a existing one two weeks ago. Just one question: do we have an official guide to do this kind of upgrade for an existing jetstream cluster?

derekcollison commented 1 year ago

No worries. Yes we have some guides but are always looking to improve. /cc @bruth

In general if you are adding a truly new node and removing an old one..

  1. Start new one, let it get added to system. nats server report jetstream the cluster overview.
  2. Shutdown old machine.
  3. Remove its peer from the system.