Closed cjbottaro closed 3 years ago
Some more info that might help track this down...
In 2.2.1, our stream seemed to be fine for a while, but then we'd starting seeing this in the log...
[WRN] JetStream resource limits exceeded for account: "$G"
Then the Nats servers would stop responding to any request regarding the stream, for example:
PUB $JS.API.STREAM.UPDATE.change-stream _INBOX.e14c4300.c9381997 332\r\n{\"discard\":\"old\",\"duplicate_window\":120000000000,\"host\":\"nats\",\"max_age\":86400000000000,\"max_bytes\":-1,\"max_consumers\":-1,\"max_msg_size\":-1,\"max_msgs\":-1,\"module\":\"Elixir.ChangeStream\",\"name\":\"change-stream\",\"num_replicas\":3,\"otp_app\":\"change_stream\",\"port\":4222,\"retention\":\"limits\",\"storage\":\"file\",\"subjects\":[\"change-stream.*\"]}\r\n
Would just never be responded to. Even simpler requests like to $JS.API.STREAM.INFO
wouldn't be responded to. We eventually just deleted and recreated the stream and it started working again.
The strange thing is that in 2.2.1, all the Grafana graphs looked completely healthy.
What is the configuration setup for the cluster, servers?
What does nats s report
show?
What does nats s server list <N> --user=system
show?
What happened just after 12 to server memory? What event was that?
We have a 3 node Nats server cluster setup via a Kubernetes stateful set. All three nodes are started like:
nats-server
-n $(SERVER_NAME)
--cluster_name nats-production
-m 8222
-js
-sd /jetstream-storage
--cluster nats://0.0.0.0:6222
--routes nats://nats-0.nats.production.svc.cluster.local:6222,nats://nats-1.nats.production.svc.cluster.local:6222,nats://nats-2.nats.production.svc.cluster.local:6222
Where $SERVER_NAME
is one of nats-0
, nats-1
, nats-2
.
$ nats -s nats s report
Obtaining Stream stats
+------------------------------------------------------------------------------------------------------+
| Stream Report |
+---------------+---------+-----------+-----------+---------+------+---------+-------------------------+
| Stream | Storage | Consumers | Messages | Bytes | Lost | Deleted | Replicas |
+---------------+---------+-----------+-----------+---------+------+---------+-------------------------+
| change-stream | File | 0 | 1,064,200 | 4.1 GiB | 0 | 28 | nats-0, nats-1*, nats-2 |
+---------------+---------+-----------+-----------+---------+------+---------+-------------------------+
What does nats s server list
--user=system show?
I can't figure how to run that command. The subcommand server
doesn't appear in nats stream --help
(using the synadia/nats-box
Docker image).
What happened just after 12 to server memory? What event was that?
Nothing that we can tell. None of the Nats nodes have been restarted since over 24 hours ago (long before that sharp drop in memory usage). The only thing that uses Nats are a few processes that read from Postgresql CDC and write the changes to the Jetstream stream. Those processes have been running for 6 days now. They automatically reconnected after we deployed 2.2.2, and according to our logs are just happily chugging along...
01:12:33.909 [info] Published 4527 bytes to change-stream.postgres in 2ms
01:12:50.111 [info] Published 4553 bytes to change-stream.postgres in 2ms
01:13:32.890 [info] Published 4536 bytes to change-stream.postgres in 2ms
01:13:39.306 [info] Published 1075 bytes to change-stream.postgres in 1ms
01:13:39.308 [info] Published 1394 bytes to change-stream.postgres in 1ms
01:13:48.575 [info] Published 4578 bytes to change-stream.postgres in 2ms
...
If I make a request to $JS.API.STREAM.INFO.change-stream
, here's the response I get:
%{
bytes: 727,
payload: %{
"cluster" => %{
"leader" => "nats-1",
"name" => "nats-production",
"replicas" => [
%{"active" => 6123810, "current" => true, "name" => "nats-2"},
%{"active" => 6144073, "current" => true, "name" => "nats-0"}
]
},
"config" => %{
"discard" => "old",
"duplicate_window" => 120000000000,
"max_age" => 86400000000000,
"max_bytes" => -1,
"max_consumers" => -1,
"max_msg_size" => -1,
"max_msgs" => -1,
"name" => "change-stream",
"num_replicas" => 3,
"retention" => "limits",
"storage" => "file",
"subjects" => ["change-stream.*"]
},
"created" => "0001-01-01T00:00:00Z",
"state" => %{
"bytes" => 3855794230,
"consumer_count" => 0,
"first_seq" => 10927198,
"first_ts" => "2021-05-01T01:18:59.086770855Z",
"last_seq" => 11926984,
"last_ts" => "2021-05-02T01:18:59.077810519Z",
"messages" => 999759,
"num_deleted" => 28
},
"type" => "io.nats.jetstream.api.v1.stream_info_response"
},
reply_to: nil,
sid: "1",
subject: "_INBOX.fb940b53.adae713e"
}
Hope that helps and let me know what other info I can provide. Thanks!
Thanks, much appreciated.
Do you have monitoring on for the cluster? I am interested in seeing the output of /varz and /subsz?subs=1 for each server if possible. And if any of them seem to be using more cpu or memory, the /stacksz for those as well.
Could also jump on a Zoom or GH tomorrow.
Thanks, much appreciated.
Do you have monitoring on for the cluster? I am interested in seeing the output of /varz and /subsz?subs=1 for each server if possible. And if any of them seem to be using more cpu or memory, the /stacksz for those as well.
Could also jump on a Zoom or GH tomorrow.
Yes we have monitoring enabled. I can help out with requested info. 😄
Attached is the requested output for each node /varz
, /subsz?subs=1
, and /stacksz
Be happy to provide any additional details nats-monitoring-output.txt
Going to close now but please let me know if you are having any other issues.
Hey @derekcollison , since upgrading to 2.2.3 our Jetstream cluster is more stable in the sense that it responds to all requests, but it thinks it's out of resources.
Server logs are filled with:
[1] 2021/05/24 18:40:08.165541 [WRN] JetStream resource limits exceeded for account: "$G"
varz reports over quota:
"jetstream": {
"config": {
"max_memory": 134217728,
"max_storage": 8589934592,
"store_dir": "/jetstream-storage/jetstream"
},
"stats": {
"memory": 0,
"storage": 8590556028,
"accounts": 1,
"api": {
"total": 57,
"errors": 1
}
}
}
But nats s report
shows only 2.7g used for the stream:
# nats -s nats s info change-stream
Information for Stream change-stream created 2021-05-11T19:10:43Z
Configuration:
Subjects: change-stream.*
Acknowledgements: true
Retention: File - Limits
Replicas: 3
Discard Policy: Old
Duplicate Window: 2m0s
Maximum Messages: unlimited
Maximum Bytes: unlimited
Maximum Age: 1d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-production
Leader: nats-0
Replica: nats-2, current, seen 0.37s ago
Replica: nats-1, current, seen 0.37s ago
State:
Messages: 473,766
Bytes: 2.7 GiB
FirstSeq: 9,850,888 @ 2021-05-23T19:13:44 UTC
LastSeq: 10,324,760 @ 2021-05-24T08:52:27 UTC
Deleted Messages: 107
Active Consumers: 0
Our nats config looks like:
accounts { $SYS { users = [ { user: "admin", pass: "******" } ] } }
jetstream {
max_memory_store: 134217728
max_file_store: 8589934592
store_dir: "/jetstream-storage"
}
If we look at volume usage in AWS, it also shows what nats s report
shows: about 2.7gb used.
Thanks for the help!
You have 3 replicas, so 3 x 2.7GB I believe, plus some overhead for other stuff.
What does the following show on any one of the servers?
du -h /jetstream-storage/jetstream
$ kubectl exec nats-0 -- du -h /jetstream-storage/jetstream
4.0K /jetstream-storage/jetstream/$G/streams/change-stream/obs
2.7G /jetstream-storage/jetstream/$G/streams/change-stream/msgs
2.7G /jetstream-storage/jetstream/$G/streams/change-stream
2.7G /jetstream-storage/jetstream/$G/streams
2.7G /jetstream-storage/jetstream/$G
4.0K /jetstream-storage/jetstream/$SYS/_js_/S-R3F-hOvBLcBq/obs
8.0K /jetstream-storage/jetstream/$SYS/_js_/S-R3F-hOvBLcBq/snapshots
2.3M /jetstream-storage/jetstream/$SYS/_js_/S-R3F-hOvBLcBq/msgs
2.4M /jetstream-storage/jetstream/$SYS/_js_/S-R3F-hOvBLcBq
4.0K /jetstream-storage/jetstream/$SYS/_js_/_meta_/obs
8.0K /jetstream-storage/jetstream/$SYS/_js_/_meta_/snapshots
20.0K /jetstream-storage/jetstream/$SYS/_js_/_meta_/msgs
52.0K /jetstream-storage/jetstream/$SYS/_js_/_meta_
2.4M /jetstream-storage/jetstream/$SYS/_js_
2.4M /jetstream-storage/jetstream/$SYS
2.7G /jetstream-storage/jetstream
Interesting, I assumed that setting meant like "logical size of a single stream". Does max_memory_store
also have to take into account all replicas?
Also, am wrong in thinking these settings apply to single streams? In other words, if I set max_file_store
to 10g, then it's ok to have two streams that are 10g each? Or do all the streams together have to be no more than 10g?
Thanks!
If the setting is on the server JetStream block it applies to raw resources used but the server.
If in the account it means total usage, so for the $G account would need to be 2.7 X 3, so 8.1GB, I would round up to $10GB for that account.
And pulling varz from same server says you are using way more than 2.7GB?
ok checking locally now on varz myself and see something similar. Let me dig in.
The JetStream usage underneath of a server's varz was reporting total system usage incorrectly.
The fix you made seems to only affect reporting (the /varz endpoint)? I'm still unsure why we're getting "resource limits exceeded for account" when trying to make publish requests to the stream.
If the setting is on the server JetStream block it applies to raw resources used by the server.
Our config seems to be setting 8gb at the server level:
accounts { $SYS { users = [ { user: "admin", pass: "******" } ] } }
jetstream {
max_memory_store: 134217728
max_file_store: 8589934592
store_dir: "/jetstream-storage"
}
So if I'm understanding that correctly, that means in a 3 node cluster a stream cannot exceed 24gb?
As opposed to setting the max_file_store at the account level, which says "a stream cannot exceed 8gb, no matter how many nodes are in a cluster.
This is assuming stream size is defined as the sum of all its replicas.
So I'm still not sure why we're getting these errors. Is there documentation or any example config files to look at? All I could find was this one: https://github.com/nats-io/nats-server/blob/master/server/configs/js-op.conf
Thanks again.
Yes the change was around varz's JetStream subsection reporting global values for usage as monitored by all servers, not just for that server.
In your case above, 3x8GB of storage means you have 24GB to use anyway you want. If you have a 3x replicated stream that means that stream can house about 8GB since it will be using 24GB once replicated.
In your setup what does nats account info
report?
Connection Information:
Client ID: 74
Client IP: 10.1.21.149
RTT: 533.533µs
Headers Supported: true
Maximum Payload: 1.0 MiB
Connected Cluster: nats-production
Connected URL: nats://nats:4222
Connected Address: 10.1.62.38:4222
Connected Server ID: NDX5ZSCHLP5YNDKOXENTOQKMNW4JGCZNQ7VGHRHHITDWOHARBNEMGQRC
Connected Server Name: nats-0
JetStream Account Information:
Memory: 0 B of 128 MiB
Storage: 8.0 GiB of 8.0 GiB
Streams: 1 of Unlimited
Consumers: 0 of Unlimited
Connection Information:
Client ID: 41
Client IP: 10.1.21.149
RTT: 411.379µs
Headers Supported: true
Maximum Payload: 1.0 MiB
Connected Cluster: nats-production
Connected URL: nats://nats:4222
Connected Address: 10.1.57.83:4222
Connected Server ID: NDYHV6IKT66BXM6IFLDFV6B2PYSQ7DS46YB5FG6OMD5F4LJDEWIWTIRS
Connected Server Name: nats-1
JetStream Account Information:
Memory: 0 B of 128 MiB
Storage: 8.0 GiB of 8.0 GiB
Streams: 1 of Unlimited
Consumers: 0 of Unlimited
Connection Information:
Client ID: 49
Client IP: 10.1.21.149
RTT: 472.198µs
Headers Supported: true
Maximum Payload: 1.0 MiB
Connected Cluster: nats-production
Connected URL: nats://nats:4222
Connected Address: 10.1.55.154:4222
Connected Server ID: NABQ5M7XKG3DG54RZFCSWOFD6DUUXQAAELL3VRSA7UX7E44LZVBPQ6M3
Connected Server Name: nats-2
JetStream Account Information:
Memory: 0 B of 128 MiB
Storage: 8.0 GiB of 8.0 GiB
Streams: 1 of Unlimited
Consumers: 0 of Unlimited
So our 3-node cluster has 24 gb to work with, but according to each node, they are full.
But according to du
, each node is only using 2.7 gb of disk.
And according to nats s report
and nats s info
, the stream is only 2.7 gb.
Somethings not adding up, yes?
You have the main account limited to use 8GB system wide, regardless of the underlying NATS system resources. Change that to something higher since that accounts for all storage including replicas for that account.
So for 2.7GB stream but with Replicas = 3 that is just over 8GB and you have hit the limit for that account.
How to do that? We couldn't find any config file info in the Jetstream documentation. Here's our config now:
accounts { $SYS { users = [ { user: "admin", pass: "******" } ] } }
jetstream {
max_memory_store: 134217728
max_file_store: 8589934592
store_dir: "/jetstream-storage"
}
Thanks!
If you want to simulate the global account and no credentials etc but do want the jetstream block you can do the following.
listen: HOST:4222
server_name: "S-1"
cluster {
name: "MY_CLUSTER"
listen: HOST:6222
routes = []
}
jetstream: {max_mem_store: 64GB, max_file_store: 10TB}
accounts: {
GLOBAL {
jetstream: {max_mem: 1GB, max_store: 1TB, max_streams: 10, max_consumers: 1k}
users: [ {user: default_user, password: pass} ]
},
$SYS { users = [ { user: "admin", pass: "s3cr3t!" } ] }
}
no_auth_user: default_user
We started seeing strange resource usage patterns after updating to 2.2.2. Our application code hasn't changed at all, and is very simple. It just creates a Jetstream stream and publishes messages to it (via new style requests). There are no consumers at all.
The memory usage of one of servers kept growing despite having zero connections to it. Also, it looks like the subscriptions are growing indefinitely, though that might be a reporting error.
After upgrading to 2.2.2, you can see the subscriptions keeps growing indefinitely, but we figured that was just a reporting issue due to metric names being changed. I.e.
gnatsd_varz_subscriptions
vsgnatsd_subsz_num_subscriptions
(the latter doesn't grow indefinitely, but holds constant).But then seeing memory usage grow on one server, seeming indefinitely until a sudden drop, caused us to open this issue. Something seem strange.
Also, all the server logs show this error happening a lot:
We have very few connections to the Nats cluster (as seen in the graphs). Each connection makes a subscription to
_INBOX.<inbox-uid>.*
and makes requests usingreply_to
s like_INBOX.<inbox-uid>.<request-uid>
. Lastly, the data volumes attached to each node are only at 20% capacity.How to debug this further? Thanks for the help.