nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

SIGHUP reload loses JetStream availability #1736

Closed philpennock closed 3 years ago

philpennock commented 3 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

/usr/local/sbin/nats-server -c /etc/nats/nats.conf -DV ``` [56863] 2020/11/22 20:51:28.835595 [INF] Starting nats-server version 2.2.0-beta.33 [56863] 2020/11/22 20:51:28.835736 [DBG] Go build version go1.15.5 [56863] 2020/11/22 20:51:28.835744 [INF] Git commit [4d51a41d] [56863] 2020/11/22 20:51:28.835823 [WRN] Plaintext passwords detected, use nkeys or bcrypt [56863] 2020/11/22 20:51:28.835833 [INF] Using configuration file: /etc/nats/nats.conf [56863] 2020/11/22 20:51:28.836179 [INF] Starting JetStream [56863] 2020/11/22 20:51:28.836747 [WRN] _ ___ _____ ___ _____ ___ ___ _ __ __ [56863] 2020/11/22 20:51:28.836763 [WRN] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ | [56863] 2020/11/22 20:51:28.836770 [WRN] | || | _| | | \__ \ | | | / _| / _ \| |\/| | [56863] 2020/11/22 20:51:28.836776 [WRN] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_| [56863] 2020/11/22 20:51:28.836782 [WRN] [56863] 2020/11/22 20:51:28.836787 [WRN] _ _ [56863] 2020/11/22 20:51:28.836793 [WRN] | |__ ___| |_ __ _ [56863] 2020/11/22 20:51:28.836798 [WRN] | '_ \/ -_) _/ _` | [56863] 2020/11/22 20:51:28.836803 [WRN] |_.__/\___|\__\__,_| [56863] 2020/11/22 20:51:28.836809 [WRN] [56863] 2020/11/22 20:51:28.836814 [WRN] JetStream is a Beta feature [56863] 2020/11/22 20:51:28.836819 [WRN] https://github.com/nats-io/jetstream [56863] 2020/11/22 20:51:28.836825 [INF] [56863] 2020/11/22 20:51:28.836830 [INF] ----------- JETSTREAM ----------- [56863] 2020/11/22 20:51:28.836848 [INF] Max Memory: 32.00 MB [56863] 2020/11/22 20:51:28.836860 [INF] Max Storage: 16.00 GB [56863] 2020/11/22 20:51:28.836870 [INF] Store Directory: "/srv/nats/js1/store" [56863] 2020/11/22 20:51:28.836877 [DBG] Exports: [56863] 2020/11/22 20:51:28.836884 [DBG] $JS.API.INFO [56863] 2020/11/22 20:51:28.836894 [DBG] $JS.API.STREAM.TEMPLATE.CREATE.* [56863] 2020/11/22 20:51:28.836903 [DBG] $JS.API.STREAM.TEMPLATE.NAMES [56863] 2020/11/22 20:51:28.836912 [DBG] $JS.API.STREAM.TEMPLATE.INFO.* [56863] 2020/11/22 20:51:28.836921 [DBG] $JS.API.STREAM.TEMPLATE.DELETE.* [56863] 2020/11/22 20:51:28.836930 [DBG] $JS.API.STREAM.CREATE.* [56863] 2020/11/22 20:51:28.836939 [DBG] $JS.API.STREAM.UPDATE.* [56863] 2020/11/22 20:51:28.836948 [DBG] $JS.API.STREAM.NAMES [56863] 2020/11/22 20:51:28.836959 [DBG] $JS.API.STREAM.LIST [56863] 2020/11/22 20:51:28.836969 [DBG] $JS.API.STREAM.INFO.* [56863] 2020/11/22 20:51:28.836979 [DBG] $JS.API.STREAM.LOOKUP [56863] 2020/11/22 20:51:28.836989 [DBG] $JS.API.STREAM.DELETE.* [56863] 2020/11/22 20:51:28.836999 [DBG] $JS.API.STREAM.PURGE.* [56863] 2020/11/22 20:51:28.837012 [DBG] $JS.API.STREAM.SNAPSHOT.* [56863] 2020/11/22 20:51:28.837022 [DBG] $JS.API.STREAM.RESTORE.* [56863] 2020/11/22 20:51:28.837032 [DBG] $JS.API.STREAM.MSG.DELETE.* [56863] 2020/11/22 20:51:28.837043 [DBG] $JS.API.STREAM.MSG.GET.* [56863] 2020/11/22 20:51:28.837053 [DBG] $JS.API.CONSUMER.CREATE.* [56863] 2020/11/22 20:51:28.837064 [DBG] $JS.API.CONSUMER.DURABLE.CREATE.*.* [56863] 2020/11/22 20:51:28.837077 [DBG] $JS.API.CONSUMER.NAMES.* [56863] 2020/11/22 20:51:28.837089 [DBG] $JS.API.CONSUMER.LIST.* [56863] 2020/11/22 20:51:28.837099 [DBG] $JS.API.CONSUMER.INFO.*.* [56863] 2020/11/22 20:51:28.837110 [DBG] $JS.API.CONSUMER.DELETE.*.* [56863] 2020/11/22 20:51:28.837121 [INF] ---------------------------------------- [56863] 2020/11/22 20:51:28.837377 [DBG] Enabled JetStream for account "Foobar" [56863] 2020/11/22 20:51:28.837394 [DBG] Max Memory: -1 B [56863] 2020/11/22 20:51:28.837404 [DBG] Max Storage: -1 B [56863] 2020/11/22 20:51:28.837450 [INF] Recovering JetStream state for account "Foobar" [56863] 2020/11/22 20:51:28.837683 [INF] JetStream state for account "Foobar" recovered [56863] 2020/11/22 20:51:28.837906 [DBG] Enabled JetStream for account "Anonymous" [56863] 2020/11/22 20:51:28.837921 [DBG] Max Memory: -1 B [56863] 2020/11/22 20:51:28.837931 [DBG] Max Storage: -1 B [56863] 2020/11/22 20:51:28.837967 [INF] Recovering JetStream state for account "Anonymous" [56863] 2020/11/22 20:51:28.839761 [INF] Restored 7 messages for Stream "SVEN-GIT-MAINLOG" [56863] 2020/11/22 20:51:28.839946 [INF] Recovering 1 Consumers for Stream - "SVEN-GIT-MAINLOG" [56863] 2020/11/22 20:51:28.841803 [INF] Restored 1 messages for Stream "SVEN-GIT-PUSHES" [56863] 2020/11/22 20:51:28.843210 [INF] Restored 1 messages for Stream "SVEN-GIT-REFS" [56863] 2020/11/22 20:51:28.843418 [INF] JetStream state for account "Anonymous" recovered [56863] 2020/11/22 20:51:28.843577 [INF] Starting http monitor on 0.0.0.0:8222 [56863] 2020/11/22 20:51:28.843654 [INF] Listening for client connections on 0.0.0.0:4222 [56863] 2020/11/22 20:51:28.843665 [INF] TLS required for client connections [56863] 2020/11/22 20:51:28.843673 [INF] Server id is NBFQQQMBBS4JA4DXYSKM6UJMX4CEQKNN7YUVQJRYLL6YIY7Y6MCWDU4L [56863] 2020/11/22 20:51:28.843680 [INF] Server name is NBFQQQMBBS4JA4DXYSKM6UJMX4CEQKNN7YUVQJRYLL6YIY7Y6MCWDU4L [56863] 2020/11/22 20:51:28.843686 [INF] Server is ready ```
nats.conf, partially redacted ```conf port: 4222 monitor_port: 8222 client_advertise: nats.lan jetstream { store_dir: "/srv/nats/js1/store" max_memory_store: 32MiB max_file_store: 16GiB } tls { ca_file: /etc/ssl/cert.pem cert_file: /etc/nats/tls.crt key_file: /etc/nats/tls.key } accounts: { Anonymous: { jetstream: enabled users: [ {user: anonymous, password: "none"} ] } System: { users: [ {user: control, password: "$crypted-pw-censored"} ] } Foobar: { jetstream: enabled users: [ {user: foo1, password: "$crypted-pw-censored"} ] } } system_account: System no_auth_user: anonymous ```

OS/Container environment:

FreeBSD 11.4, but I think this causes a symptom I saw in an Alpine Linux 3.12.1 container too.

Steps or code to reproduce the issue:

  1. Start NATS server. See it load JS data store, have client able to query JS streams
  2. Send server a SIGHUP
  3. See the nats: error: could not create Stream: nats: no responders available for request error message
  4. Restart server, have nats client able to query streams
  5. Send server a SIGHUP, see that this is repeatable: once one or more SIGHUPs has been received by the NATS server, it stops responding to jetstream client requests

Expected result:

Server should work right after a reload.

Actual result:

Reload breaks Jetstream.

In the logs here, I hit enter for a blank line between the client request, the reload, and the second client request.

nats -DV output across the reload ``` [58811] 2020/11/22 21:02:27.007071 [DBG] 192.168.120.1:57228 - cid:9 - Client connection created [58811] 2020/11/22 21:02:27.007594 [DBG] 192.168.120.1:57228 - cid:9 - Starting TLS client connection handshake [58811] 2020/11/22 21:02:27.054567 [DBG] 192.168.120.1:57228 - cid:9 - TLS handshake complete [58811] 2020/11/22 21:02:27.054595 [DBG] 192.168.120.1:57228 - cid:9 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256 [58811] 2020/11/22 21:02:27.054733 [TRC] 192.168.120.1:57228 - cid:9 - <<- [CONNECT {"verbose":false,"pedantic":false,"tls_required":false,"name":"NATS CLI Version v0.0.19-27-g8d386 d1","lang":"go","version":"1.11.0","protocol":1,"echo":true,"headers":true,"no_responders":true}] [58811] 2020/11/22 21:02:27.054985 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [PING] [58811] 2020/11/22 21:02:27.055004 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - ->> [PONG] [58811] 2020/11/22 21:02:27.056574 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [SUB _INBOX.e2g0RbGKlGGqOaTYEqwBir.* 1] [58811] 2020/11/22 21:02:27.056629 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [PUB $JS.API.INFO _INBOX.e2g0RbGKlGGqOaTYEqwBir.ZrEW2a tg 0] [58811] 2020/11/22 21:02:27.056647 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- MSG_PAYLOAD: [""] [58811] 2020/11/22 21:02:27.057170 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - ->> [MSG _INBOX.e2g0RbGKlGGqOaTYEqwBir.ZrEW2atg 1 223] [58811] 2020/11/22 21:02:27.061477 [DBG] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - Client connection closed: Client Closed [58811] 2020/11/22 21:02:27.061524 [TRC] 192.168.120.1:57228 - cid:9 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <-> [DELSUB 1] [58811] 2020/11/22 21:02:40.664207 [DBG] Trapped "hangup" signal [58811] 2020/11/22 21:02:40.877400 [INF] Reloaded: authorization users [58811] 2020/11/22 21:02:40.877431 [INF] Reloaded: accounts [58811] 2020/11/22 21:02:40.877442 [INF] Reloaded: tls = enabled [58811] 2020/11/22 21:02:40.877545 [TRC] ACCOUNT - <-> [DELSUB 1] [58811] 2020/11/22 21:02:40.877572 [DBG] ACCOUNT - Account connection closed: Internal Client [58811] 2020/11/22 21:02:40.877611 [TRC] ACCOUNT - <-> [DELSUB 23] [58811] 2020/11/22 21:02:40.877634 [TRC] ACCOUNT - <-> [DELSUB 2] [58811] 2020/11/22 21:02:40.877653 [TRC] ACCOUNT - <-> [DELSUB 8] [58811] 2020/11/22 21:02:40.877672 [TRC] ACCOUNT - <-> [DELSUB 13] [58811] 2020/11/22 21:02:40.877691 [TRC] ACCOUNT - <-> [DELSUB 14] [58811] 2020/11/22 21:02:40.877709 [TRC] ACCOUNT - <-> [DELSUB 22] [58811] 2020/11/22 21:02:40.877728 [TRC] ACCOUNT - <-> [DELSUB 21] [58811] 2020/11/22 21:02:40.877746 [TRC] ACCOUNT - <-> [DELSUB 3] [58811] 2020/11/22 21:02:40.877764 [TRC] ACCOUNT - <-> [DELSUB 5] [58811] 2020/11/22 21:02:40.877783 [TRC] ACCOUNT - <-> [DELSUB 9] [58811] 2020/11/22 21:02:40.877803 [TRC] ACCOUNT - <-> [DELSUB 11] [58811] 2020/11/22 21:02:40.877822 [TRC] ACCOUNT - <-> [DELSUB 20] [58811] 2020/11/22 21:02:40.877840 [TRC] ACCOUNT - <-> [DELSUB 6] [58811] 2020/11/22 21:02:40.877857 [TRC] ACCOUNT - <-> [DELSUB 12] [58811] 2020/11/22 21:02:40.877878 [TRC] ACCOUNT - <-> [DELSUB 16] [58811] 2020/11/22 21:02:40.877896 [TRC] ACCOUNT - <-> [DELSUB 18] [58811] 2020/11/22 21:02:40.877914 [TRC] ACCOUNT - <-> [DELSUB 19] [58811] 2020/11/22 21:02:40.877932 [TRC] ACCOUNT - <-> [DELSUB 17] [58811] 2020/11/22 21:02:40.877951 [TRC] ACCOUNT - <-> [DELSUB 1] [58811] 2020/11/22 21:02:40.877968 [TRC] ACCOUNT - <-> [DELSUB 4] [58811] 2020/11/22 21:02:40.877988 [TRC] ACCOUNT - <-> [DELSUB 7] [58811] 2020/11/22 21:02:40.878006 [TRC] ACCOUNT - <-> [DELSUB 10] [58811] 2020/11/22 21:02:40.878024 [TRC] ACCOUNT - <-> [DELSUB 15] [58811] 2020/11/22 21:02:40.878040 [DBG] ACCOUNT - Account connection closed: Internal Client [58811] 2020/11/22 21:02:40.878056 [DBG] ACCOUNT - Account connection closed: Internal Client [58811] 2020/11/22 21:02:40.878079 [TRC] ACCOUNT - <-> [DELSUB 15] [58811] 2020/11/22 21:02:40.878099 [TRC] ACCOUNT - <-> [DELSUB 18] [58811] 2020/11/22 21:02:40.878117 [TRC] ACCOUNT - <-> [DELSUB 23] [58811] 2020/11/22 21:02:40.878135 [TRC] ACCOUNT - <-> [DELSUB 11] [58811] 2020/11/22 21:02:40.878154 [TRC] ACCOUNT - <-> [DELSUB 20] [58811] 2020/11/22 21:02:40.878172 [TRC] ACCOUNT - <-> [DELSUB 22] [58811] 2020/11/22 21:02:40.878190 [TRC] ACCOUNT - <-> [DELSUB 2] [58811] 2020/11/22 21:02:40.878208 [TRC] ACCOUNT - <-> [DELSUB 3] [58811] 2020/11/22 21:02:40.878228 [TRC] ACCOUNT - <-> [DELSUB 8] [58811] 2020/11/22 21:02:40.878247 [TRC] ACCOUNT - <-> [DELSUB 10] [58811] 2020/11/22 21:02:40.878266 [TRC] ACCOUNT - <-> [DELSUB 12] [58811] 2020/11/22 21:02:40.878284 [TRC] ACCOUNT - <-> [DELSUB 14] [58811] 2020/11/22 21:02:40.878302 [TRC] ACCOUNT - <-> [DELSUB 16] [58811] 2020/11/22 21:02:40.878320 [TRC] ACCOUNT - <-> [DELSUB 17] [58811] 2020/11/22 21:02:40.878339 [TRC] ACCOUNT - <-> [DELSUB 1] [58811] 2020/11/22 21:02:40.878357 [TRC] ACCOUNT - <-> [DELSUB 5] [58811] 2020/11/22 21:02:40.878376 [TRC] ACCOUNT - <-> [DELSUB 7] [58811] 2020/11/22 21:02:40.878394 [TRC] ACCOUNT - <-> [DELSUB 9] [58811] 2020/11/22 21:02:40.878411 [TRC] ACCOUNT - <-> [DELSUB 19] [58811] 2020/11/22 21:02:40.878429 [TRC] ACCOUNT - <-> [DELSUB 4] [58811] 2020/11/22 21:02:40.878447 [TRC] ACCOUNT - <-> [DELSUB 6] [58811] 2020/11/22 21:02:40.878466 [TRC] ACCOUNT - <-> [DELSUB 13] [58811] 2020/11/22 21:02:40.878484 [TRC] ACCOUNT - <-> [DELSUB 21] [58811] 2020/11/22 21:02:40.878500 [DBG] ACCOUNT - Account connection closed: Internal Client [58811] 2020/11/22 21:02:40.878536 [INF] Reloaded server configuration [58811] 2020/11/22 21:02:46.959419 [DBG] 192.168.120.1:57230 - cid:11 - Client connection created [58811] 2020/11/22 21:02:46.959556 [DBG] 192.168.120.1:57230 - cid:11 - Starting TLS client connection handshake [58811] 2020/11/22 21:02:47.019747 [DBG] 192.168.120.1:57230 - cid:11 - TLS handshake complete [58811] 2020/11/22 21:02:47.019775 [DBG] 192.168.120.1:57230 - cid:11 - TLS version 1.3, cipher suite TLS_AES_128_GCM_SHA256 [58811] 2020/11/22 21:02:47.019910 [TRC] 192.168.120.1:57230 - cid:11 - <<- [CONNECT {"verbose":false,"pedantic":false,"tls_required":false,"name":"NATS CLI Version v0.0.19-27-g8d38 6d1","lang":"go","version":"1.11.0","protocol":1,"echo":true,"headers":true,"no_responders":true}] [58811] 2020/11/22 21:02:47.020014 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [PING] [58811] 2020/11/22 21:02:47.020033 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - ->> [PONG] [58811] 2020/11/22 21:02:47.021745 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [SUB _INBOX.7sX7qtrb7vQnbJxaXwD9F2.* 1] [58811] 2020/11/22 21:02:47.021792 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- [PUB $JS.API.INFO _INBOX.7sX7qtrb7vQnbJxaXwD9F2.GcNCP j2g 0] [58811] 2020/11/22 21:02:47.021814 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <<- MSG_PAYLOAD: [""] [58811] 2020/11/22 21:02:47.024865 [DBG] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - Client connection closed: Client Closed [58811] 2020/11/22 21:02:47.024905 [TRC] 192.168.120.1:57230 - cid:11 - "v1.11.0:go:NATS CLI Version v0.0.19-27-g8d386d1" - <-> [DELSUB 1] ^C[58811] 2020/11/22 21:02:49.844273 [DBG] Trapped "interrupt" signal [58811] 2020/11/22 21:02:49.844379 [TRC] JETSTREAM - <-> [DELSUB 2] [58811] 2020/11/22 21:02:49.844408 [TRC] JETSTREAM - <-> [DELSUB 3] [58811] 2020/11/22 21:02:49.844512 [DBG] JETSTREAM - JetStream connection closed: Client Closed [58811] 2020/11/22 21:02:49.846995 [DBG] JETSTREAM - JetStream connection closed: Client Closed [58811] 2020/11/22 21:02:49.847203 [DBG] JETSTREAM - JetStream connection closed: Client Closed [58811] 2020/11/22 21:02:49.847342 [DBG] JETSTREAM - JetStream connection closed: Client Closed [58811] 2020/11/22 21:02:49.847474 [INF] Initiating Shutdown... [58811] 2020/11/22 21:02:49.847590 [DBG] Client accept loop exiting.. [58811] 2020/11/22 21:02:49.847609 [INF] Server Exiting.. ```
kozlovic commented 3 years ago

I believe I have fixed that in the mqtt branch (reload.go). I am not referencing the commit here since the mqtt branch is likely to be rebased, forced-pushed, etc.. today or so, but here is the change which I think fixes the issue:

    // We will double check all JetStream configs on a reload.
    if checkJetStream {
-       s.configAllJetStreamAccounts()
+       if s.globalAccountOnly() {
+           s.GlobalAccount().EnableJetStream(nil)
+       } else {
+           s.configAllJetStreamAccounts()
+       }
    }

    if res := s.AccountResolver(); res != nil {
derekcollison commented 3 years ago

Thanks @kozlovic

derekcollison commented 3 years ago

Trying to reproduce in a test now, do you change the config when you send it a signal?

kozlovic commented 3 years ago

As a reminder, this is fixed in the mqtt branch.

kozlovic commented 3 years ago

I don't seem to be able to reproduce from main branch.. let me see what test in MQTT had an issue that made be have to add the above code in reload.go to fix the issue...

derekcollison commented 3 years ago

I can't reproduce so far either..

kozlovic commented 3 years ago

Ok, you need to have a config reload try to reload auth. This is when the global account is recreated (but without JS). Then trying to say do a nats con info would fail.

derekcollison commented 3 years ago

Its the assignment of the system account that is causing the issue, I have recreated and have a test.