nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.51k stars 1.39k forks source link

Consumer name list request returns "JetStream system temporarily unavailable" error #4913

Open liyancoding opened 8 months ago

liyancoding commented 8 months ago

Observed behavior

 "{\"type\":\"io.nats.jetstream.advisory.v1.api_audit\",\"id\":\"2umOADXZzhtp6yEmc0KZml\",\"timestamp\":\"2023-12-27T08:05:22.330372788Z\",\"server\":\"hdh-nats-v1-2\",\"client\":{\"acc\":\"$G\",\"rtt\":982011,\"server\":\"hdh-nats-v1-1\",\"cluster\":\"mss-nats\"},\"subject\":\"$JS.API.CONSUMER.NAMES.SMART_LIVE_START\",\"request\":\"{\\\"offset\\\":0}\",\"response\":\"{\\\"type\\\":\\\"io.nats.jetstream.api.v1.consumer_names_response\\\",\\\"error\\\":{\\\"code\\\":503,\\\"err_code\\\":10008,\\\"description\\\":\\\"JetStream system temporarily unavailable\\\"},\\\"total\\\":0,\\\"offset\\\":0,\\\"limit\\\":0,\\\"consumers\\\":[]}\"}"

Expected behavior

The error persists after the cluster is restarted. The final solution is to deploy nats to other nodes. I tried it without problem. Finally, this problem is solved by deploying to the original node. However, the reason why the stream cannot be used temporarily is not found

Server and client version

version is nats-server: v2.9.20

Host environment

No response

Steps to reproduce

No response

derekcollison commented 8 months ago

Make sure you are running the latest, for 2.9.x it's 2.9.24, and for 2.10.x it's 2.10.7.

liyancoding commented 8 months ago

What's the reason?

derekcollison commented 8 months ago

We have a tremendous amount of users and a very fast growing number of large customers. It's difficult for our team to support or triage issues for the OSS community if they are not on the latest version.

If you are a paying customer, that is different.

derekcollison commented 8 months ago

Any updates?

liyancoding commented 8 months ago

Version is v2.9.23, not the latest. However, the NATS memory keeps increasing and does not decrease. The stream is enabled. It's a serious problem

derekcollison commented 8 months ago

Are you a Synadia customer?

liyancoding commented 8 months ago

What is Synadia Customer?

derekcollison commented 8 months ago

Synadia is the company behind the NATS.io ecosystem. We have customers and we prioritize them in terms of GH issues etc.

liyancoding commented 8 months ago

We're not Synadia customers.

derekcollison commented 8 months ago

ok, for our OSS users we ask you upgrade to the latest server and clients. Server is 2.10.9 now. If the issue persists we would be happy to dig in and work with you on a solution.

liyancoding commented 8 months ago

ok, thanks.

kohlisid commented 1 month ago

@derekcollison Seeing a similar issue here, There are messages on the stream, but when trying to consume or create new consumers for the same stream I get this issue.

? Select a Consumer xxx-simple-pipeline-out-0
nats: error: could not load Consumer xxx-simple-pipeline-out-0 > xxx-simple-pipeline-out-0: JetStream system temporarily unavailable (10008)

The memory usage is also very high.

Using

[7] 2024/08/06 23:44:37.076574 [INF] Starting nats-server
[7] 2024/08/06 23:44:37.076714 [INF]   Version:  2.10.18
[7] 2024/08/06 23:44:37.076718 [INF]   Git:      [57d23ac]
kohlisid commented 1 month ago
              Subjects: xxx-simple-pipeline-out-0
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 1m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: 100,000
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 3d0h0m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: default
                Leader: 
               Replica: isbsvc-default-js-0, outdated, seen 18m6s ago, 10,310 operations behind
               Replica: isbsvc-default-js-2, outdated, seen 26m47s ago, 10,310 operations behind
               Replica: isbsvc-default-js-4, outdated, seen 17m47s ago

State:

              Messages: 10,308
                 Bytes: 13 GiB
        First Sequence: 1 @ 2024-08-06 23:47:07 UTC
         Last Sequence: 10,308 @ 2024-08-06 23:55:47 UTC
      Active Consumers: 1
    Number of Subjects: 1
derekcollison commented 1 month ago

The NRG layer looks like its struggling, could be a network issue or a mis-configuration of the NATS system.

kohlisid commented 1 month ago

I believe I was hitting some disk limit in this case. I have increased the limits to try and replicate if this occurs again. On another note, when the disk available vs the total_storage allowed for jetstream is mismatched. What should be the expected behaviour?

derekcollison commented 1 month ago

If a server encounters a quota issue with the underlying store it will log it and shutdown jetstream for that server. You will see that clearly in the logs..

kohlisid commented 1 month ago

The tail logs showing up in the server were as follows @derekcollison

[7] 2024/08/07 00:33:14.149953 [DBG] RAFT [JLAxTIGX - S-R3F-66I5xvhX] Sending out voteRequest {term:351 lastTerm:1 lastIndex:245 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:14.149992 [WRN] JetStream cluster stream 'js > KV_xxxx-simple-pipeline-in_SOURCE_OT' has NO quorum, stalled
[7] 2024/08/07 00:33:14.211314 [DBG] 10.214.110.10:6222 - rid:12 - Router Ping Timer
[7] 2024/08/07 00:33:14.316765 [DBG] 10.214.83.38:57342 - rid:11 - Router Ping Timer
[7] 2024/08/07 00:33:14.327738 [DBG] RAFT [JLAxTIGX - S-R3F-4E4Qm94d] Sending out voteRequest {term:347 lastTerm:2 lastIndex:10181 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:14.419642 [DBG] 10.214.110.10:6222 - rid:14 - Router Ping Timer
[7] 2024/08/07 00:33:14.427824 [DBG] 10.214.110.10:6222 - rid:13 - Router Ping Timer
[7] 2024/08/07 00:33:15.179397 [DBG] 10.214.76.98:6222 - rid:17 - Router Ping Timer
[7] 2024/08/07 00:33:15.195393 [DBG] 10.214.76.98:6222 - rid:16 - Router Ping Timer
[7] 2024/08/07 00:33:15.195410 [DBG] 10.214.110.10:6222 - rid:18 - Router Ping Timer
[7] 2024/08/07 00:33:15.200489 [DBG] 10.214.64.115:6222 - rid:19 - Router Ping Timer
[7] 2024/08/07 00:33:15.260139 [DBG] 10.214.64.115:6222 - rid:21 - Router Ping Timer
[7] 2024/08/07 00:33:15.269118 [DBG] 10.214.64.115:6222 - rid:23 - Router Ping Timer
[7] 2024/08/07 00:33:15.327054 [DBG] 10.214.64.115:6222 - rid:20 - Router Ping Timer
[7] 2024/08/07 00:33:15.364602 [DBG] 10.214.76.98:6222 - rid:22 - Router Ping Timer
[7] 2024/08/07 00:33:15.415052 [DBG] 10.214.76.98:6222 - rid:24 - Router Ping Timer
[7] 2024/08/07 00:33:17.817384 [DBG] RAFT [JLAxTIGX - C-R3F-LwsCNRjm] Sending out voteRequest {term:339 lastTerm:2 lastIndex:13861 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:21.063380 [DBG] RAFT [JLAxTIGX - S-R3F-66I5xvhX] Sending out voteRequest {term:352 lastTerm:1 lastIndex:245 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:22.168353 [DBG] RAFT [JLAxTIGX - S-R3F-4E4Qm94d] Sending out voteRequest {term:348 lastTerm:2 lastIndex:10181 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:26.677069 [WRN] JetStream cluster consumer 'js > xxxx-simple-pipeline-out-2 > xxxx-simple-pipeline-out-2' has NO quorum, stalled.
[7] 2024/08/07 00:33:26.677145 [DBG] RAFT [JLAxTIGX - C-R3F-LwsCNRjm] Sending out voteRequest {term:340 lastTerm:2 lastIndex:13861 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:26.927723 [DBG] RAFT [JLAxTIGX - S-R3F-4E4Qm94d] Sending out voteRequest {term:349 lastTerm:2 lastIndex:10181 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:26.927763 [WRN] JetStream cluster stream 'js > xxxx-simple-pipeline-out-2' has NO quorum, stalled
[7] 2024/08/07 00:33:29.931302 [DBG] RAFT [JLAxTIGX - S-R3F-66I5xvhX] Sending out voteRequest {term:353 lastTerm:1 lastIndex:245 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:30.932134 [INF] JetStream cluster new consumer leader for 'js > KV_xxxx-simple-pipeline-in_SOURCE_OT > ohrI01x9'
[7] 2024/08/07 00:33:30.935159 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2024/08/07 00:33:30.935181 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2024/08/07 00:33:30.936133 [DBG] RAFT [JLAxTIGX - _meta_] Installing snapshot of 11619 bytes
[7] 2024/08/07 00:33:33.552226 [DBG] RAFT [JLAxTIGX - C-R3F-LwsCNRjm] Sending out voteRequest {term:341 lastTerm:2 lastIndex:13861 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:33.552377 [DBG] RAFT [JLAxTIGX - S-R3F-4E4Qm94d] Sending out voteRequest {term:350 lastTerm:2 lastIndex:10181 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:35.777956 [DBG] RAFT [JLAxTIGX - S-R3F-66I5xvhX] Sending out voteRequest {term:354 lastTerm:1 lastIndex:245 candidate:JLAxTIGX reply:}
[7] 2024/08/07 00:33:35.777997 [WRN] JetStream cluster stream 'js > KV_xxxx-simple-pipeline-in_SOURCE_OT' has NO quorum, stalled
derekcollison commented 1 month ago

Yes no one is responding to the votes.. So either system mis-configured or some servers have shutdown the jetstream subsystem..