nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.87k stars 1.4k forks source link

JetStream is enabled for account, while iniating jetstream cluster with jwt authentication - jetstream not enabled #4924

Open shankarmn94 opened 10 months ago

shankarmn94 commented 10 months ago

Observed behavior

operator: xxxxx
system_account: yyyy

configuration of the nats based resolver

resolver {
    type: full
    # Directory in which the account jwt will be stored
    dir: './jwt'
    # In order to support jwt deletion, set to true
    # If the resolver type is full delete will rename the jwt.
    # This is to allow manual restoration in case of inadvertent deletion.
    # To restore a jwt, remove the added suffix .delete and restart or send a reload signal.
    # To free up storage you must manually delete files with the suffix .delete.
    allow_delete: false
    # Interval at which a nats-server with a nats based account resolver will compare
    # it's state with one random nats based account resolver in the cluster and if needed, 
    # exchange jwt and converge on the same set of jwt.
    interval: "2m"
    # Timeout for lookup requests in case an account does not exist locally.
    timeout: "60s"
}

resolver_preload: {
    abcdefgh
}

server_name=n1-coordin1
listen=4222

jetstream {
  max_mem: 12G,
  max_file: 100G,
  store_dir="/natsdata"
}

http_port: 8222

cluster {
  name: lax
  listen: 0.0.0.0:6222
  routes: [
    nats://10.102.0.4:6222
    nats://10.102.0.5:6222
  ]
}

debug: false
trace:  false
logtime: true
max_connections: 5000000
max_payload: 8388608
write_deadline: "300s"

same with change in ip's and name in other nodes, when i start the jetstream cluster as

/usr/bin/nats-server -c /etc/nats_server/resolver.conf -js -l /var/log/nats_server/nats.log

[931351] 2024/01/03 17:20:21.946398 [INF] Starting nats-server
[931351] 2024/01/03 17:20:21.946480 [INF]   Version:  2.10.3
[931351] 2024/01/03 17:20:21.946486 [INF]   Git:      [12344]
[931351] 2024/01/03 17:20:21.946490 [INF]   Name:     n1-coordin1
[931351] 2024/01/03 17:20:21.946495 [INF]   Node:     xxx
[931351] 2024/01/03 17:20:21.946499 [INF]   ID:   xxxxx
[931351] 2024/01/03 17:20:21.946521 [INF] Using configuration file: /etc/nats_server/resolver.conf
[931351] 2024/01/03 17:20:21.946525 [INF] Trusted Operators
[931351] 2024/01/03 17:20:21.946529 [INF]   System  : ""
[931351] 2024/01/03 17:20:21.946533 [INF]   Operator: "secu"
[931351] 2024/01/03 17:20:21.946538 [INF]   Issued  : 2023-12-25 19:04:59 +0000 UTC
[931351] 2024/01/03 17:20:21.946554 [INF]   Expires : Never
[931351] 2024/01/03 17:20:21.947745 [INF] Starting http monitor on 0.0.0.0:8222
[931351] 2024/01/03 17:20:21.947849 [INF] Managing all jwt in exclusive directory /jwt
[931351] 2024/01/03 17:20:21.947870 [INF] Starting JetStream
[931351] 2024/01/03 17:20:21.947978 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[931351] 2024/01/03 17:20:21.947984 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[931351] 2024/01/03 17:20:21.947986 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[931351] 2024/01/03 17:20:21.947988 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[931351] 2024/01/03 17:20:21.947990 [INF]
[931351] 2024/01/03 17:20:21.947993 [INF]          https://docs.nats.io/jetstream
[931351] 2024/01/03 17:20:21.947995 [INF]
[931351] 2024/01/03 17:20:21.947997 [INF] ---------------- JETSTREAM ----------------
[931351] 2024/01/03 17:20:21.948003 [INF]   Max Memory:      11.18 GB
[931351] 2024/01/03 17:20:21.948007 [INF]   Max Storage:     93.13 GB
[931351] 2024/01/03 17:20:21.948009 [INF]   Store Directory: "/natsdata/jetstream"
[931351] 2024/01/03 17:20:21.948011 [INF] -------------------------------------------
[931351] 2024/01/03 17:20:22.036721 [INF]   Starting restore for stream 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > agent_assetinfo'
[931351] 2024/01/03 17:20:22.049547 [INF]   Restored 31,382 messages for stream 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > agent_assetinfo' in 13ms
[931351] 2024/01/03 17:20:22.049706 [INF]   Starting restore for stream 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > agent_jobdata'
[931351] 2024/01/03 17:20:22.049980 [INF]   Restored 0 messages for stream 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > agent_jobdata' in 0s

after some restoration of streams and messages,

[918701] 2024/01/03 16:23:55.827095 [INF]   Recovering 1 consumers for stream - 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > tenant_sync1'
[918701] 2024/01/03 16:23:55.827236 [INF]   Recovering 1 consumers for stream - 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > tenant_sync2'
[918701] 2024/01/03 16:23:55.827437 [INF]   Recovering 1 consumers for stream - 'AAWB4CNHGPWJVBDUMZIS4GSN4K5DT3XYNGOGN6EC5GDQBVW5BAYYGL3G > tenant_sync3'
[918701] 2024/01/03 16:23:55.827646 [INF] Starting JetStream cluster
[918701] 2024/01/03 16:23:55.827662 [INF] Creating JetStream metadata controller
[918701] 2024/01/03 16:23:55.827848 [INF] JetStream cluster recovering state
[918701] 2024/01/03 16:23:55.837987 [INF] Listening for client connections on 0.0.0.0:4222
[918701] 2024/01/03 16:23:55.838282 [INF] Server is ready
[918701] 2024/01/03 16:23:55.838347 [INF] Cluster name is lax
[918701] 2024/01/03 16:23:55.838409 [INF] Listening for route connections on 0.0.0.0:6222

In Other two nodes logs will be like

[1351310] 2024/01/03 17:09:33.502213 [INF] Starting nats-server
[1351310] 2024/01/03 17:09:33.503181 [INF]   Version:  2.10.3
[1351310] 2024/01/03 17:09:33.503186 [INF]   Git:      [12344]
[1351310] 2024/01/03 17:09:33.503190 [INF]   Cluster:  lax
[1351310] 2024/01/03 17:09:33.503193 [INF]   Name:     n1-pod-102-worker-2
[1351310] 2024/01/03 17:09:33.503199 [INF]   Node:     xxxx
[1351310] 2024/01/03 17:09:33.503202 [INF]   ID:   xxxx
[1351310] 2024/01/03 17:09:33.503221 [INF] Using configuration file: /etc/nats_server/resolver.conf
[1351310] 2024/01/03 17:09:33.503226 [INF] Trusted Operators
[1351310] 2024/01/03 17:09:33.503230 [INF]   System  : ""
[1351310] 2024/01/03 17:09:33.503234 [INF]   Operator: "secu"
[1351310] 2024/01/03 17:09:33.503238 [INF]   Issued  : 2023-12-25 19:04:59 +0000 UTC
[1351310] 2024/01/03 17:09:33.503259 [INF]   Expires : Never
[1351310] 2024/01/03 17:09:33.503901 [INF] Starting http monitor on 0.0.0.0:8222
[1351310] 2024/01/03 17:09:33.503982 [INF] Managing all jwt in exclusive directory /jwt
[1351310] 2024/01/03 17:09:33.503992 [INF] Starting JetStream
[1351310] 2024/01/03 17:09:33.504091 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[1351310] 2024/01/03 17:09:33.504096 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[1351310] 2024/01/03 17:09:33.504098 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[1351310] 2024/01/03 17:09:33.504101 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[1351310] 2024/01/03 17:09:33.504103 [INF]
[1351310] 2024/01/03 17:09:33.504106 [INF]          https://docs.nats.io/jetstream
[1351310] 2024/01/03 17:09:33.504108 [INF]
[1351310] 2024/01/03 17:09:33.504111 [INF] ---------------- JETSTREAM ----------------
[1351310] 2024/01/03 17:09:33.504118 [INF]   Max Memory:      11.18 GB
[1351310] 2024/01/03 17:09:33.504121 [INF]   Max Storage:     93.13 GB
[1351310] 2024/01/03 17:09:33.504124 [INF]   Store Directory: "/natsdata/jetstream"
[1351310] 2024/01/03 17:09:33.504127 [INF] -------------------------------------------
[1351310] 2024/01/03 17:09:33.504213 [INF] Starting JetStream cluster
[1351310] 2024/01/03 17:09:33.504217 [INF] Creating JetStream metadata controller
[1351310] 2024/01/03 17:09:33.529539 [INF] JetStream cluster recovering state
[1351310] 2024/01/03 17:09:33.559466 [INF] Listening for client connections on 0.0.0.0:4222
[1351310] 2024/01/03 17:09:33.559678 [INF] Server is ready
[1351310] 2024/01/03 17:09:33.560381 [INF] Cluster name is lax
[1351310] 2024/01/03 17:09:33.560467 [INF] Listening for route connections on 0.0.0.0:6222
[1351310] 2024/01/03 17:09:33.605812 [INF] 10.102.0.5:6222 - rid:8 - Route connection created

Once all the message will be restored an meta leader will be voted. later while run account info

Account Information

User: xxxx
Account: xxx
Expires: never
Client ID: 380
Client IP: 127.0.0.1
RTT: 174µs
Headers Supported: true
Maximum Payload: 8.0 MiB
Connected Cluster: lax
Connected URL: nats://127.0.0.1:4222
Connected Address: 127.0.0.1:4222
Connected Server ID: xxxx
Connected Server Version: 2.10.3
Connected Server Name: n1-pod-102-worker-2
TLS Connection: no
No response from JetStream server

could some please help me with this. we have this and will be increasing

Streams: 56,905 of Unlimited
Consumers: 40,195 of Unlimited
further we will be getting timeouts to jetstreams,

Expected behavior

Jetstream should enable and running,

Server and client version

Am using binary bin package to run nats - servers

nats-server --version
nats-server: v2.10.3

nats --version
0.1.1

Host environment

Description:    Ubuntu 22.04.3 LTS
Codename:   jammy

ram: 24GB, allocated 12gb for nats-server
CPU(s):                  12

Steps to reproduce

No response

Jarema commented 10 months ago

Thank you for filling the issue and providing info.

Do I understand it correctly that you have above 50k streams and consumers on your cluster? Are those streams and consumers R1 (replicas=1), or R3?

shankarmn94 commented 10 months ago

Thank you for filling the issue and providing info.

Do I understand it correctly that you have above 50k streams and consumers on your cluster? Are those streams and consumers R1 (replicas=1), or R3?

yes we do have 50k streams and consumers. we are execpting it will reach more than 500k. replica is 1. currently we are getting, 'nats: error: context deadline exceeded' when we send a request.

is there anything wrong we are doing or configs, what strategy we can keep.

Jarema commented 10 months ago

May I ask for a reason for such big number of streams and what is the scale of your solution? One stream can have multiple subjects as part of it, and each subject can have wildcards. That allows for much, much smaller number of streams and consumers on the cluster.

shankarmn94 commented 10 months ago

We have a product where we deploy executables on machines which then communicate back to the backend server. All these different executables have to have different permissions.

We need to deploy some 1 million such executables i.e. 1 million streams/queues. Also, there is multitenancy in this solution.

Will having lesser number of streams and more queues help with this? for performance and scalability?

wallyqs commented 10 months ago

@shankarmn94 could you share some sample requests that are timing out? Also make sure that you set GOMEMLIMIT to what it was allocated to nats under the host GOMEMLIMIT=12GiB. Do you use SSDs in this setup? Also could you share the slow_consumer counters from /varz? I see that write_deadline is set to 5 minutes (300s). Do you have a lot of readloop processing warnings in the logs? If so could you share some of those lines?

niknikhil025 commented 10 months ago

@wallyqs

import nats
import json
import gzip

async def send_data():
    queue_name = "somestream"
    nc = await nats.connect("nats://127.0.0.1", user_credentials="admin.creds")
    js = nc.jetstream()

    stream_name = "somestream"

    data = {"RandomKey": "ABC"}
    s = gzip.compress(json.dumps(data).encode('utf-8'))
    for _ in range(100):
        res = await js.publish(stream_name, s)
        print("Pubish Response", res)

if __name__ == "__main__":
    import asyncio
    asyncio.run(send_data())

This is what we have been using for testing when the server is running on load

shankarmn94 commented 10 months ago

@shankarmn94 could you share some sample requests that are timing out? Also make sure that you set GOMEMLIMIT to what it was allocated to nats under the host GOMEMLIMIT=12GiB. Do you use SSDs in this setup? Also could you share the slow_consumer counters from /varz? I see that write_deadline is set to 5 minutes (300s). Do you have a lot of readloop processing warnings in the logs? If so could you share some of those lines?

  1. mem limit was kept 12gb, i have increased 24gb later due to timeouts.

  2. NVMe storage.

  3. readloop was going beyond 15 mins, part of log

    [1733421] 2024/01/05 05:00:52.064196 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m4.035186948s
    [1733421] 2024/01/05 05:00:52.064213 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m4.035448362s
    [1733421] 2024/01/05 05:00:52.064221 [WRN] 155.138.199.8:5050 - cid:43020725 - Readloop processing time: 5m4.035521745s
    [1733421] 2024/01/05 05:00:52.092702 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m4.063669819s
    [1733421] 2024/01/05 05:00:52.092740 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m4.063713258s
    [1733421] 2024/01/05 05:00:52.092759 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m4.063937467s
    [1733421] 2024/01/05 05:00:52.092769 [WRN] 155.138.199.8:57400 - cid:43020722 - Readloop processing time: 5m4.063971071s
  4. /varz

    {
    "server_id": "xxxxx",
    "server_name": "n1-coordin1",
    "version": "2.10.3",
    "proto": 1,
    "git_commit": "1528434",
    "go": "go1.21.3",
    "host": "0.0.0.0",
    "port": 4222,
    "auth_required": true,
    "max_connections": 50000000,
    "ping_interval": 120000000000,
    "ping_max": 2,
    "http_host": "0.0.0.0",
    "http_port": 8222,
    "http_base_path": "",
    "https_port": 0,
    "auth_timeout": 2,
    "max_control_line": 4096,
    "max_payload": 8388608,
    "max_pending": 67108864,
    "cluster": {},
    "gateway": {},
    "leaf": {},
    "mqtt": {},
    "websocket": {},
    "jetstream": {
        "config": {
            "max_memory": 24000000000,
            "max_storage": 100000000000,
            "store_dir": "/natsdata/jetstream",
            "sync_interval": 120000000000,
            "compress_ok": true
        },
        "stats": {
            "memory": 0,
            "storage": 80203451597,
            "reserved_memory": 0,
            "reserved_storage": 0,
            "accounts": 1,
            "ha_assets": 0,
            "api": {
                "total": 1600952,
                "errors": 694
            }
        }
    },
    "tls_timeout": 2,
    "write_deadline": 300000000000,
    "start": "2024-01-04T12:18:25.534824007Z",
    "now": "2024-01-05T05:22:36.090021125Z",
    "uptime": "17h4m10s",
    "mem": 8418717696,
    "cores": 16,
    "gomaxprocs": 16,
    "cpu": 52,
    "connections": 13122,
    "total_connections": 44046795,
    "routes": 0,
    "remotes": 0,
    "leafnodes": 0,
    "in_msgs": 8418892,
    "out_msgs": 8770439,
    "in_bytes": 66940447093,
    "out_bytes": 176571876137,
    "slow_consumers": 10,
    "subscriptions": 114177,
    "http_req_stats": {
        "/": 1,
        "/varz": 2
    },
    "config_load_time": "2024-01-04T12:18:25.534824007Z",
    "trusted_operators_jwt": [
        "x.y.z-a"
    ],
    "trusted_operators_claim": [
        {
            "jti": "xyz",
            "iat": 1702645766,
            "iss": "xyz",
            "name": "xyz",
            "sub": "xyz",
            "nats": {
                "operator_service_urls": [
                    "nats://localhost:4222",
                    "nats://0.0.0.0:4222",
                    "nats://10.101.0.4:4222",
                    "nats://10.101.0.5:4222",
                    "nats://10.101.0.6:4222",
                    "nats://10.101.0.3:4222",                  
                ],
                "system_account": "xxxxx",
                "type": "operator",
                "version": 2
            }
        }
    ],
    "system_account": "xxxx",
    "ocsp_peer_cache": {},
    "slow_consumer_stats": {
        "clients": 10,
        "routes": 0,
        "gateways": 0,
        "leafs": 0
    }
    }
wallyqs commented 10 months ago

Thanks, I see that there are slow consumers under a write deadline of 5 minutes so think that would be correlated with some of the warnings reported in the logs cid:43020725 - Readloop processing time: 5m4.035521745s.

shankarmn94 commented 10 months ago

Thanks, I see that there are slow consumers under a write deadline of 5 minutes so think that would be correlated with some of the warnings reported in the logs cid:43020725 - Readloop processing time: 5m4.035521745s.

How can we reduce these warning and scale it to higher extent

shankarmn94 commented 10 months ago

we are unable to delete the streams due to this, and there is nats database system has directory system for jetstreams, if i delete a jetstream folder, will that remove that jetstream and wont affect the running nats-server.

/jetstream/XXIASUHIAD/streams/{ streams }

shankarmn94 commented 9 months ago

[3449495] 2024/01/11 15:47:26.529650 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.022350708s [3449495] 2024/01/11 15:47:26.531816 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.024523971s [3449495] 2024/01/11 15:47:26.531842 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.024589769s [3449495] 2024/01/11 15:47:26.552009 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.044614631s [3449495] 2024/01/11 15:47:26.552092 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.04472295s [3449495] 2024/01/11 15:47:26.552299 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 5m22.044846542s

can i know why exaclty will this happen @wallyqs @Jarema @krobertson

wallyqs commented 9 months ago

@shankarmn94 the write deadline that you are using is way too long for the cluster to be healthy I'd say, it is allowing for the readloop to be blocking for as long (so the internal jetstream callbacks are being blocked for 5 minutes which is also your configured write_deadline), I'd recommend setting it to a lower value (the default is 10s). Coincidentally we did some improvements around this in v2.10.8 to avoid blocking in certain areas which you can try first though too.

shankarmn94 commented 9 months ago

@shankarmn94 the write deadline that you are using is way too long for the cluster to be healthy I'd say, it is allowing for the readloop to be blocking for as long (so the internal jetstream callbacks are being blocked for 5 minutes which is also your configured write_deadline), I'd recommend setting it to a lower value (the default is 10s). Coincidentally we did some improvements around this in v2.10.8 to avoid blocking in certain areas which you can try first though too.

@wallyqs

v2.10.3 we are running this in production, if i change the nats-bin package to v2.10.8

will it not disturb the cluster and since version changes a folder with different id will that not affect.!

i was running the cluster earlier, more than 60k streams we created and due to this error we stopped cluster and run as standalone and cleared all the streams and ran to 3k streams and created subjects to scupsctions.. on standalone it was optimised a bit.

so we initiated the other nodes to bring back as cluster.. and again we ran into same issue, 3k streams went to 60k again and jet streams not initialising on cluster,

can you please suggest to optimise this and remove the jetstreams.

derekcollison commented 9 months ago

Please move to 2.10.9, we had an quick followup release from an issue we detected late but after we had tagged 2.10.8.

shankarmn94 commented 9 months ago

@derekcollison i had the doubt...

am using bin package file here... if i just change the nats-server bin package will that work?? or anything i need to follow

derekcollison commented 9 months ago

Yes it should just work!

shankarmn94 commented 9 months ago

[3901845] 2024/01/11 19:29:10.204129 [INF] Self is new JetStream cluster metadata leader [3901845] 2024/01/11 19:29:10.355394 [INF] JetStream cluster no metadata leader [3901845] 2024/01/11 19:29:12.460864 [ERR] 206.127.8.66:51810 - cid:602880 - Publish Violation - JWT User "UDBLGS3Q2DJGTM2O2IVKRFNIY5OL2V44MAC2Z5NBTX6XFJVSY4QNF7KI", Subject "queue_adaudit" [3901845] 2024/01/11 19:29:13.789127 [ERR] 50.209.91.165:19591 - cid:603287 - Publish Violation - JWT User "UCZQDLYJNV5VKQKJ2LCEOBBNMN76WPMNEADIU2R7WGRLZ54ZNHMMFXCU", Subject "queue_adaudit" [3901845] 2024/01/11 19:29:14.288962 [WRN] RAFT [vkCvov6n - meta] Request from follower for entry at index [10102] errored for state {Msgs:811 Bytes:86699925 FirstSeq:8941 FirstTime:2024-01-11 19:23:05.115863327 +0000 UTC LastSeq:13327 LastTime:2024-01-11 19:29:14.284155299 +0000 UTC NumSubjects:0 Subjects:map[] NumDeleted:3576 Deleted:[] Lost: Consumers:0} - no message found [3901845] 2024/01/11 19:29:14.289020 [WRN] RAFT [vkCvov6n - meta] Could not find a starting entry for catchup request: no message found [3901845] 2024/01/11 19:29:14.426006 [INF] Self is new JetStream cluster metadata leader [3901845] 2024/01/11 19:29:14.573051 [INF] JetStream cluster no metadata leader

@derekcollison now am continuosly getting JetStream cluster no metadata leader keeps changing on 4 nodes..

wallyqs commented 9 months ago

@shankarmn94 do you have some logs on how long it took the system to restart and recover the streams? do you still see readloop processing errors?

derekcollison commented 9 months ago

What does nats server report jetstream show as system user?

shankarmn94 commented 9 months ago

@shankarmn94 do you have some logs on how long it took the system to restart and recover the streams? do you still see read loop processing errors?

@wallyqs nearly 1 hour and 30 mins,

yes, I still have read loops, sometimes I get [ERR] JetStream resource limits exceeded for server, what's limit of streams.

shankarmn94 commented 9 months ago

[3846184] 2024/01/12 07:58:06.610549 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.12294832s [3846184] 2024/01/12 07:58:06.615553 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.127183807s [3846184] 2024/01/12 07:58:06.615599 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.127234999s [3846184] 2024/01/12 07:58:06.615627 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.128004851s [3846184] 2024/01/12 07:58:06.620636 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.132252151s [3846184] 2024/01/12 07:58:06.620685 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.132304514s [3846184] 2024/01/12 07:58:06.620733 [WRN] Internal subscription on "$JS.API.STREAM.NAMES" took too long: 6m49.132849962s

how can i avoid this.

shankarmn94 commented 9 months ago

[610014] 2024/01/12 16:12:13.587715 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_2' in 2ms [610014] 2024/01/12 16:12:13.587828 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_3' [610014] 2024/01/12 16:12:13.589479 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_3' in 2ms [610014] 2024/01/12 16:12:13.589597 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_4' [610014] 2024/01/12 16:12:13.590799 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_4' in 1ms [610014] 2024/01/12 16:12:13.590897 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_5' [610014] 2024/01/12 16:12:13.592753 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_5' in 2ms [610014] 2024/01/12 16:12:13.592888 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_6' [610014] 2024/01/12 16:12:13.595048 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_6' in 2ms [610014] 2024/01/12 16:12:13.595164 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_7' [610014] 2024/01/12 16:12:13.597254 [INF] Restored 3,991 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_7' in 2ms [610014] 2024/01/12 16:12:13.597368 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_8' [610014] 2024/01/12 16:12:13.599201 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_8' in 2ms [610014] 2024/01/12 16:12:13.599296 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_9' [610014] 2024/01/12 16:12:13.601287 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > software_checker_new_9' in 2ms [610014] 2024/01/12 16:12:13.601379 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync0' [610014] 2024/01/12 16:12:13.603161 [INF] Restored 1,291,534 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync0' in 2ms [610014] 2024/01/12 16:12:13.603262 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync1' [610014] 2024/01/12 16:12:13.604970 [INF] Restored 886,427 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync1' in 2ms [610014] 2024/01/12 16:12:13.605466 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync2' [610014] 2024/01/12 16:12:13.607355 [INF] Restored 620,322 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync2' in 2ms [610014] 2024/01/12 16:12:13.607463 [INF] Starting restore for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync3' [610014] 2024/01/12 16:12:13.609105 [INF] Restored 0 messages for stream 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > tenant_sync3' in 2ms [610014] 2024/01/12 16:12:13.609180 [INF] Recovering 2 consumers for stream - 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_assetinfo' [610014] 2024/01/12 16:12:13.609779 [WRN] Consumer 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_assetinfo > agent_assetinfo_consumer' error on write store state from check redelivered: bad pending entry, sequence [17199533] out of range [610014] 2024/01/12 16:12:13.609839 [WRN] Consumer 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_assetinfo > agent_assetinfo_consumer' error on write store state from check redelivered: bad pending entry, sequence [17199312] out of range [610014] 2024/01/12 16:12:13.610070 [WRN] Consumer 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_assetinfo > agent_assetinfo_consumer' error on write store state from check redelivered: bad pending entry, sequence [17197095] out of range [610014] 2024/01/12 16:12:13.610751 [INF] Recovering 1 consumers for stream - 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_jobdata' [610014] 2024/01/12 16:12:13.741538 [INF] Recovering 122842 consumers for stream - 'ACPA6I2TJJQHUXMH6NPOF3QX3ERUL4XJ7XVH4Q5UXPACXFNMI52WI6K3 > agent_stream'

i get these now after restarting the nats-server and is not initializing,

@derekcollison @wallyqs @Jarema

derekcollison commented 9 months ago

Have you upgraded your system to 2.10.9?

Have you consider a commercial agreement for support with Synadia?

shankarmn94 commented 9 months ago

@derekcollison @Jarema

yes i upgraded, and recovering of these consumers will take so long. why? what it will be trying to do.

is there chance where nats-server will restart automatically, where this happened two times. where we did not restart.

Jarema commented 9 months ago

There is to little information on your system, architecture and resources available to answer your questions on high-load setup. NATS Server will not restart on it's own. That depends on your process/containers orchestration.

shankarmn94 commented 9 months ago

@Jarema

RAM- Mem: 56G

CPU-

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel Core Processor (Skylake, IBRS) CPU family: 6 Model: 94 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1

derekcollison commented 8 months ago

Any updates on this issue?

shankarmn94 commented 8 months ago

i updated to (nats-server: v2.10.9) version a month ago.. am seeing this happens less now.

any idea what happens in between if i use a loadbalancer.

derekcollison commented 8 months ago

We generally do not recommend having a load balancer between clients and NATS servers.

Also not the latest patch release is 2.10.11, and we recommend everyone update to the latest.

shankarmn94 commented 8 months ago

@derekcollison is there way i can take backup of nsc users, since we are using nsc keys. i need to keep a backup of these users.

derekcollison commented 8 months ago

Those are not kept within the server, they are actually never even sent to the server (the private keys). They are stored locally on the machine issuing nsc commands.

Looping in @aricart to assist.

aricart commented 8 months ago

@shankarmn94 nsc users are not sent nor stored in the server. They are all local to your nsc environment. You can see where assets are stored by doing:

nsc env
+----------------------------------------------------------------------------------------------------------------+
|                                                NSC Environment                                                 |
+--------------------------+-----+-------------------------------------------------------------------------------+
| Setting                  | Set | Effective Value                                                               |
+--------------------------+-----+-------------------------------------------------------------------------------+
| $XDG_CONFIG_HOME         | No  |                                                                               |
| $XDG_DATA_HOME           | No  |                                                                               |
| $NSC_CWD_ONLY            | No  | If set, default operator/account from cwd only                                |
| $NSC_NO_GIT_IGNORE       | No  | If set, no .gitignore files written                                           |
| $NKEYS_PATH (deprecated) | No  | ~/.local/share/nats/nsc/keys                                                  |
| $NSC_HOME (deprecated)   | No  | ~/.config/nats/nsc                                                            |
| $NATS_CA                 | No  | If set, root CAs in the referenced file will be used for nats connections     |
|                          |     | If not set, will default to the system trust store                            |
| $NATS_KEY                | No  | If set, the tls key in the referenced file will be used for nats connections  |
| $NATS_CERT               | No  | If set, the tls cert in the referenced file will be used for nats connections |
+--------------------------+-----+-------------------------------------------------------------------------------+
| From CWD                 |     | No                                                                            |
| Default Stores Dir       |     | ~/.local/share/nats/nsc/stores                                                |
| Current Store Dir        |     | ~/.local/share/nats/nsc/stores                                                |
| Current Operator         |     | O                                                                             |
| Current Account          |     | sys                                                                           |
| Root CAs to trust        |     | Default: System Trust Store                                                   |
+--------------------------+-----+-------------------------------------------------------------------------------+

In the above example keys and creds are stored in ~/.local/share/nats/nsc/keys, and operator and account configurations are in ~/.local/share/nats/nsc/stores. Those are the directories to back up:

/Users/aricart/.local/share/nats/nsc/keys
├── creds
│   ├── O
│   │   └── A
│   │       ├── U.creds
│   │       └── a.creds
│   ├── OO
│   │   └── A
│   │       ├── f.creds
│   │       └── foobar.creds
│   ├── OOOO
│   │   └── SYS
│   │       └── a.creds
│   └── oooooo
│       └── A
│           ├── U.creds
│           ├── test.creds
│           └── xxx.creds
└── keys
    ├── A
    │   ├── AL
    │   │   └── AALLRAZVBOIV6ASHLW3ETZINC742BNEVTSBAI6QXK34C2CE5FV2L7YUO.nk
    │   ├── AR
    │   │   └── AAR42PGSDTTBAPHWJPL7QVHCNCYJGDJ2K6VDFRF4SVUIKVUWBNIINWX6.nk
    │   ├── AT

Note that all user keys will be stored in the /Users/aricart/.local/share/nats/nsc/keys/keys/U subtree.

aricart commented 8 months ago

To load all those keys and creds into a new environment you can try:

nsc fix -h

shankarmn94 commented 8 months ago

how can i take backup of users for nsc keys users token creds and all @aricart

aricart commented 8 months ago

@shankarmn94 Just backup the ~/.local/share/nats/nsc/keys and ~/.local/share/nats/nsc/stores directories.

shankarmn94 commented 7 months ago

can i know y am getting context deadline timeout for specifi streams and when i do stream ls

it will list like inaccessible stream section what this tells. @derekcollison

shankarmn94 commented 7 months ago

nats consumer report appl ? Select a Stream application_patching nats: error: context deadline exceeded

nats consumer report agent_assetinfo ? Select a Stream agent_assetinfo ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Consumer report for agent_assetinfo with 3 consumers │ ├──────────────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬───────────┬──────────────────────┤ │ Consumer │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster │ ├──────────────────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼───────────┼──────────────────────┤ │ agent_assetinfo_consumer │ Pull │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 965,553 │ n1-pod-500-worker-3 │ │ golangnats │ Pull │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 965,553 │ n1-pod-500-worker-3 │ │ security_report_card │ Pull │ Explicit │ 30.00s │ 15 │ 0 │ 0 │ 965,538 │ n1-pod-500-worker-3* │ ╰──────────────────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴───────────┴──────────────────────╯