nats cluster with jetstreams on hardware with slow disk is prone to OOM

tvojacek commented 1 year ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

[ ] Included nats-server -DV output [142] 2022/10/03 20:19:39.311095 [INF] Starting nats-server [142] 2022/10/03 20:19:39.311121 [INF] Version: 2.9.2 [142] 2022/10/03 20:19:39.311124 [INF] Git: [6d81dde] [142] 2022/10/03 20:19:39.311127 [DBG] Go build: go1.19.1 [142] 2022/10/03 20:19:39.311130 [INF] Name: NDEWXKWIMBLGMGDSTVGK4CA7RE2RX76SHFFQPRX3AMQK57KAUJ2COX5K [142] 2022/10/03 20:19:39.311135 [INF] ID: NDEWXKWIMBLGMGDSTVGK4CA7RE2RX76SHFFQPRX3AMQK57KAUJ2COX5K [142] 2022/10/03 20:19:39.311161 [DBG] Created system account: "$SYS"

Versions of `nats-server` and affected client libraries used:

2.8.4, 2.9.0, 2.9.2

OS/Container environment:

helm 0.18.0 + image 2.9.2-apline 3x aws 16GB instance, 12GB RAM limit

$kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
nats-0                     2/3     Running   12         63m
nats-1                     2/3     Running   9          65m
nats-2                     3/3     Running   2          80m
nats-box-dc8c89664-b8l5q   1/1     Running   0          10d

Steps or code to reproduce the issue:

start cluster,
create jetstream raw-data,
create 2x consumers jetstream config and consumers are in attachment
Expected result:

cluster start downloading messages from leafnodes jetstreams, at network speed until jetstream max_bytes is hit (1GiB) both consumer consume messages as fast as they can (one is much slower than other one) consumers consume messages from jetstream. jetstream sources lag goes to zero

Actual result:

after while,

nodes restart multiple time due to OOM,
consumers stops recieving messages
jetstream system is broken
nats consumer report SERVER_DATA repond with nats: error: JetStream system temporarily unavailable (10008), try --help

logs.zip

kozlovic commented 1 year ago

@TomasVojacek Thank you for the report. I see this worrisome trace in log1.log:

[45] 2022/10/03 19:57:00.185262 [WRN] Waiting for routing to be established...
[45] 2022/10/03 19:57:00.420945 [INF] 172.30.19.253:43932 - rid:15 - Route connection created
[45] 2022/10/03 19:57:00.421298 [INF] 172.30.19.253:43932 - rid:15 - Router connection closed: Duplicate Route
[45] 2022/10/03 19:57:00.429257 [INF] 172.30.41.131:37134 - rid:16 - Route connection created
[45] 2022/10/03 19:57:00.429613 [INF] 172.30.41.131:37134 - rid:16 - Router connection closed: Duplicate Route
[45] 2022/10/03 19:59:00.751026 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[45] 2022/10/03 19:59:00.751542 [WRN] 172.30.41.131:6222 - rid:14 - Readloop processing time: 2m0.329090184s
[45] 2022/10/03 19:59:00.751787 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[45] 2022/10/03 19:59:00.752023 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[45] 2022/10/03 19:59:00.752243 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"

Notice the route saying that it was stuck for more than 2 minutes! That could explain why the messages accumulate, so I wonder if there is not some kind of blocking situation, maybe not deadlock per-se but something wrong for sure. If you can reproduce and notice a server starting to build-up memory, the same way you collected nats1.heap, could you run the profile to collect CPU stats for few seconds, or even better, hit the "/stacksz" endpoint few times at several seconds intervals? That would help us a lot. Thanks!

tvojacek commented 1 year ago

I failed to get cluster to working state. Data on disk are causing some problem. Sending logs from this state. (May be data could have been corrupted by downgrade from 2.9.1 to 2.9.0) All Nodes restart periodically Node1 get completely separated from cluster nats sub ">" on node0 and node2 return data as expected on node1: nats sub ">" does nothing nats str report nats: error: setup failed: read tcp 127.0.0.1:56594->127.0.0.1:4223: i/o timeout

log from node1

[3586] 2022/10/04 10:35:26.653669 [INF] ---------------- JETSTREAM ----------------
[3586] 2022/10/04 10:35:26.653678 [INF]   Max Memory:      1.00 GB
[3586] 2022/10/04 10:35:26.653682 [INF]   Max Storage:     100.00 GB
[3586] 2022/10/04 10:35:26.653685 [INF]   Store Directory: "/data/jetstream"
[3586] 2022/10/04 10:35:26.653687 [INF]   Domain:          smarttx_cloud
[3586] 2022/10/04 10:35:26.653689 [INF] -------------------------------------------
[3586] 2022/10/04 10:35:26.657266 [INF]   Starting restore for stream '$G > STX_SERVER_DATA'
[3586] 2022/10/04 10:35:26.819008 [INF]   Restored 331,972 messages for stream '$G > STX_SERVER_DATA'
[3586] 2022/10/04 10:35:26.819371 [INF]   Recovering 2 consumers for stream - '$G > STX_SERVER_DATA'
[3586] 2022/10/04 10:35:26.825770 [INF] Starting JetStream cluster
[3586] 2022/10/04 10:35:26.825794 [INF] Creating JetStream metadata controller
[3586] 2022/10/04 10:35:26.827664 [INF] JetStream cluster recovering state
[3586] 2022/10/04 10:35:26.828924 [INF] Listening for leafnode connections on 0.0.0.0:7422
[3586] 2022/10/04 10:35:26.829459 [INF] Listening for client connections on 0.0.0.0:4222
[3586] 2022/10/04 10:35:26.829601 [INF] TLS required for client connections
[3586] 2022/10/04 10:35:26.829835 [INF] Server is ready
[3586] 2022/10/04 10:35:26.830500 [INF] Cluster name is nats
[3586] 2022/10/04 10:35:26.830549 [INF] Listening for route connections on 0.0.0.0:6222
[3586] 2022/10/04 10:35:26.835867 [INF] 172.30.41.131:6222 - rid:13 - Route connection created
[3586] 2022/10/04 10:35:26.836947 [INF] 172.30.19.253:6222 - rid:14 - Route connection created
[3586] 2022/10/04 10:35:26.930133 [WRN] Waiting for routing to be established...
[3586] 2022/10/04 10:35:27.481113 [INF] 172.30.19.253:46214 - rid:15 - Route connection created
[3586] 2022/10/04 10:35:27.481449 [INF] 172.30.19.253:46214 - rid:15 - Router connection closed: Duplicate Route
[3586] 2022/10/04 10:35:27.504002 [INF] 172.30.41.131:42562 - rid:16 - Route connection created
[3586] 2022/10/04 10:35:27.504294 [INF] 172.30.41.131:42562 - rid:16 - Router connection closed: Duplicate Route

node2.restart_it_self.log

node0.restart_it_self.log node1.1328.log log with trace enabled profile info: stuck_state_profile.tar.gz

derekcollison commented 1 year ago

Could you share the config files for the servers in the cluster?

tvojacek commented 1 year ago

nats.conf.txt

tvojacek commented 1 year ago

I have deleted /data/jetstream Restarted cluster Created jetstreams (config are in firts post)

nats --context=prod stream add STX_SERVER_DATA --config=raw-data-no-sources-prod.json
nats --context=prod consumer add STX_SERVER_DATA RAW_DATA_SERVICE --config=raw-data-service-consumer.json
nats --context=prod consumer add STX_SERVER_DATA DB_SERVICE --config=db-service-consumer.json
nats --context=prod stream edit STX_SERVER_DATA --config=raw-data-prod.json

Nats start as expected, for while it looked ok. Data were flowing. After while nats0 and nats2 start rising memory. Unit nats0 OOM at 12G. For unkown reason to me some leafnodes did not deliver data (jetstream report huge lag) All 12 leafnodes are right now test device simulators, running on same AWS machine, with shared config, only subjects are changing. Only exception is that on two leafnodes has jetstream max_bytes 20GiB while, rest 10 has only 5GB. Will sent logs and traces and stack in next post.

Before OOM crash on node0, node0 stopped respond to nats str info. consumer_report_nats0_04T174132.log

Status of jetstream are from node1 that keep responding. stx_info_04T174141.log consumer_report_nats1_04T173329.log

derekcollison commented 1 year ago

The server_name being POD_NAME, does that changes each time?

derekcollison commented 1 year ago

What does nats server ls report from the system account? How about nats server report jetstream?

And finally nats traffic, all from a system account user?

tvojacek commented 1 year ago

Profile info: profile.tar.gz logs.tar.gz

tvojacek commented 1 year ago

As can be seen in nats.conf. I do not have system account. It will take me while to change helmchart generated config or is there flag to create SYS account?

derekcollison commented 1 year ago

There is always a system account, you need to assign a user to it.

if you are server config, you can add this..

    # For access to system account.
    accounts { $SYS { users = [ { user: "admin", pass: "s3cr3t!" } ] } }

tvojacek commented 1 year ago

can I combine accounts{} with authorize{}?

kozlovic commented 1 year ago

Looking at some of the report, what I see from node1_3.stacksz is that this node is been busy recovering some state presumably for a stream:

goroutine 37 [runnable]:
encoding/binary.littleEndian.Uint32(...)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/encoding/binary/binary.go:81
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc000a984e0, {0xc0022fe000, 0x3fffef, 0x400000?})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3439 +0x2d2
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc000a984e0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3721 +0x425
github.com/nats-io/nats-server/v2/server.(*msgBlock).fetchMsg(0xc000a984e0, 0x12cb0c7, 0x48?)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3751 +0xe5
github.com/nats-io/nats-server/v2/server.(*fileStore).msgForSeq(0xc0000f6a00, 0xa5dc20?, 0xc001dd4001?)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3899 +0x138
github.com/nats-io/nats-server/v2/server.(*fileStore).LoadMsg(0xc0000f6a00?, 0xc001dc4c00?, 0x1af3?)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:4013 +0x19
github.com/nats-io/nats-server/v2/server.(*raft).loadEntry(0xc000230600, 0xc001dd2480?)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2329 +0x4b
github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode(0xc000138000, {0xa9d31e, 0x2}, 0xc000346e08)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:446 +0xeda
github.com/nats-io/nats-server/v2/server.(*jetStream).createRaftGroup(0xc00011e540, {0xa9d31e, 0x2}, 0xc0003e4960, 0x16)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1629 +0x845
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateStream(0xc00011e540, 0xc00009ab40, 0xc000174fc0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2933 +0x10f
github.com/nats-io/nats-server/v2/server.(*jetStream).processStreamAssignment(0xc00011e540, 0xc000174fc0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2696 +0x457
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaSnapshot(0xc00011e540, {0xc000172046, 0x496, 0x49f})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1203 +0x75a
github.com/nats-io/nats-server/v2/server.(*jetStream).applyMetaEntries(0xc00011e540, {0xc000012150, 0x1, 0x0?}, 0xc000347a18)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1439 +0xad5
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorCluster(0xc00011e540)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:971 +0xb68
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3077 +0x85

this is done under the "js" lock that blocks all other go routines trying to get that lock, including from the routes:

goroutine 32 [semacquire, 4 minutes]:
sync.runtime_SemacquireMutex(0xc000138038?, 0x68?, 0xc00034ac28?)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/sync/rwmutex.go:71
github.com/nats-io/nats-server/v2/server.(*Server).sendStatsz(0xc000138000, {0xc000418140, 0x4b})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:691 +0x39e
github.com/nats-io/nats-server/v2/server.(*Server).processNewServer(0xc000138000, 0xc00034b0f8)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:1298 +0x287
github.com/nats-io/nats-server/v2/server.(*Server).updateRemoteServer(0xc000138000, 0xc00034b0f8)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:1270 +0x1d6
github.com/nats-io/nats-server/v2/server.(*Server).remoteServerUpdate(0xc000138000, 0xc0001c3200?, 0xc000379980, 0x4b?, {0x0?, 0x800?}, {0x0?, 0x0?}, {0xc00017905c, 0x45c, ...})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:1257 +0x396
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc000379980, 0x0, 0xc000210540, 0x4fa?, {0xc00017900a, 0x4b, 0xff6}, {0x0, 0x0, 0x0}, ...)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3193 +0xb03
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc000379980, 0xc00009ad80, 0xc00058c5a0, {0xc00017905c, 0x45e, 0xfa4}, {0x0, 0x0, 0x1?}, {0xc00017900a, ...}, ...)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4227 +0xb10
github.com/nats-io/nats-server/v2/server.(*client).processInboundRoutedMsg(0xc000379980, {0xc00017905c, 0x45e, 0xfa4})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:443 +0x159
github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg(0xc000379980?, {0xc00017905c?, 0x45e?, 0xfa4?})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3505 +0x36
github.com/nats-io/nats-server/v2/server.(*client).parse(0xc000379980, {0xc000179000, 0x4ba, 0x1000})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x210a
github.com/nats-io/nats-server/v2/server.(*client).readLoop(0xc000379980, {0x0, 0x0, 0x0})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1238 +0xf36
github.com/nats-io/nats-server/v2/server.(*Server).createRoute.func1()
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:1372 +0x25
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3077 +0x85

This shows that this has been blocked for 4+ minutes. All the /healthz are stuck too (there are several at different time, this one shows being blocked for 4 minutes too):

goroutine 90 [semacquire, 4 minutes]:
sync.runtime_SemacquireMutex(0xc000138000?, 0x40?, 0x0?)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/sync/rwmutex.go:71
github.com/nats-io/nats-server/v2/server.(*Server).healthz(0xb9dfd0?, 0xc0001b29c6)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/monitor.go:3022 +0x1c5
github.com/nats-io/nats-server/v2/server.(*Server).HandleHealthz(0xc000138000, {0xb9dfd0, 0xc00011e0e0}, 0x4de5e9?)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/monitor.go:2978 +0x114
net/http.HandlerFunc.ServeHTTP(0xc0001b2af0?, {0xb9dfd0?, 0xc00011e0e0?}, 0x0?)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0x0?, {0xb9dfd0, 0xc00011e0e0}, 0xc000f05300)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/net/http/server.go:2487 +0x149
net/http.serverHandler.ServeHTTP({0xc000cce150?}, {0xb9dfd0, 0xc00011e0e0}, 0xc000f05300)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0000ec0a0, {0xb9e878, 0xc00019e900})
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/net/http/server.go:3102 +0x4db

So of course we need to have the route not blocked by that, but it seems that this server is taking quite a bit of time to recover things from disk. Will continue to investigate more.

tvojacek commented 1 year ago

server ls, nats server report jetstream,nats traffic

server_ls.txt

Do you need more information?

kozlovic commented 1 year ago

@TomasVojacek I am tracking down the route being blocked and am going to see if I can get a PR today to try address that. Assuming that I get a fix, will you be able to run a nightly docker image to verify that it helps in your environment? If not, there was a plan for a 2.9.3 release soon, so that would go into that one anyway.

tvojacek commented 1 year ago

@kozlovic thank you looking forward nightly docker image.

kozlovic commented 1 year ago

@TomasVojacek @derekcollison There are MANY places in the server where a routed message will be processed by a function that may try to get the "Jetstream" lock. So not sure if we will be able to really prevent that from happening. So I think that the bigger question here is why is it taking more than 4 minutes (and maybe more) to be in that same routine:

goroutine 37 [runnable]:
encoding/binary.littleEndian.Uint32(...)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/encoding/binary/binary.go:81
github.com/nats-io/nats-server/v2/server.(*msgBlock).indexCacheBuf(0xc000a984e0, {0xc0022fe000, 0x3fffef, 0x400000?})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3439 +0x2d2
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc000a984e0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3721 +0x425
...

under the Jetstream lock. It looks like there may be some issues accessing the storage (meaning it takes way to long to do those I/O operations)?

tvojacek commented 1 year ago

nats run in aws kubernetes EKS on r5a.large (2 CPU cores 16GB, instance data are stored persistent volumes backed by aws ebs gp2 volume 100GB. It is ssd with guaranteed 3000 iops. I'm not aware of problems with files system. But nats is most busy part of kubernetes. I can try run it on aws ec2 machine and check if results are the same. I started using nats 2.8.4 with 2 leafnodes, nats cluster 3 docker containers and it OOM similar way as in aws EKS. But I was limited by 16GB total memory of my notebook, it OOM more frequently. Any other suggestion I can try?

kozlovic commented 1 year ago

@TomasVojacek The processing of a stream assignment is recovering from disk under the jetstream lock that is then tried to be acquired while processing some protocol from the route:

goroutine 32 [semacquire, 4 minutes]:
sync.runtime_SemacquireMutex(0xc000138038?, 0x68?, 0xc00034ac28?)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
    /home/travis/.gimme/versions/go1.19.1.linux.amd64/src/sync/rwmutex.go:71
github.com/nats-io/nats-server/v2/server.(*Server).sendStatsz(0xc000138000, {0xc000418140, 0x4b})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:691 +0x39e
github.com/nats-io/nats-server/v2/server.(*Server).processNewServer(0xc000138000, 0xc00034b0f8)
...

This is blocking for 4 minutes in the example above, which means that it causes the route to not deliver any new message from the other servers, etc.. I am still trying to see how this can be improved, so at this point no action in your part, except maybe trying with different environment. If you do so, please use v2.9.2 (the latest).

kozlovic commented 1 year ago

Merged the PR https://github.com/nats-io/nats-server/pull/3519 that should avoid blocking the other go routines (especially when coming from routes) in PR https://github.com/nats-io/nats-server/pull/3519. I have kicked the production of the nightly so that you can give it a try and see if that resolves your issue. Pull the synadia/nats-server:nightly and make sure you get this:

[1] 2022/10/05 13:20:12.884436 [INF] Starting nats-server
[1] 2022/10/05 13:20:12.884639 [INF]   Version:  2.9.3-beta.1
[1] 2022/10/05 13:20:12.884642 [INF]   Git:      [37ca9722]

tvojacek commented 1 year ago

I have tested 2.9.3-beta.1. nats now OOM much faster than before. Screenshot from 2022-10-05 17-34-44

We have checked that during first event one of aws drive shows higher latencies. Screenshot from 2022-10-05 17-40-29 I was wrong our drive is gp2 with 300 iops. I thought that it is 3000. End of chart shows situation then 2.9.3-beta-1 try to recover.

2.9.3-beta1_logs.tar.gz 293_beta1_prof.tar.gz

kozlovic commented 1 year ago

@TomasVojacek Like I said, this would not solve the underlying problem is that the disk is too slow. Now without blocking the route, there is no "artificial" back pressure and data will accumulate even faster. I will look at the attached data, but unless you make your disk faster, you will always have problems.

tvojacek commented 1 year ago

What I do not understand is when stream report 209MiB data, we need 10GB of RAM to cache them 97.30% 97.30% 9.40GB 97.30% github.com/nats-io/nats-server/v2/server.(*msgBlock).msgFromBuf.

STX_SERVER_DATA │ File │ │ 2 │ 72,619 │ 209 MiB │ 0 │ 640131 │ smart-tx-production-nats-0!, smart-tx-production-nats-1!, smart-tx-production-nats-2!

I will try to get faster disks.

kozlovic commented 1 year ago

What I see from this new set of stacks is that some of the nodes are waiting on a stream's lock to unsubscribe when processing the response of setting a consumer of the source (from all your leafnode connections), but the stream lock is held when processing a leader change while that requires the RAFT's lock, which is again held while recovering messages from its log.

From leafnode connections:

sync.(*RWMutex).Lock(0x0?)
    /opt/hostedtoolcache/go/1.19.1/x64/src/sync/rwmutex.go:147 +0x36
github.com/nats-io/nats-server/v2/server.(*stream).unsubscribeUnlocked(0xc0001c4000, 0xc0003c6a80)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/stream.go:3213 +0x25
github.com/nats-io/nats-server/v2/server.(*stream).setSourceConsumer.func1(0xc000331228?, 0xc0011c1300, 0x9e47c0?, {0x0?, 0x0?}, {0xc000331268?, 0x41353f?}, {0xc042964063, 0x2d8, 0x3f9d})

This is the stream lock that is trying to be acquired, but held by this:

goroutine 550 [semacquire, 1 minutes]:
sync.runtime_SemacquireMutex(0xc0001c4000?, 0x0?, 0xc00002a4f8?)
    /opt/hostedtoolcache/go/1.19.1/x64/src/runtime/sema.go:77 +0x25
sync.(*RWMutex).RLock(...)
    /opt/hostedtoolcache/go/1.19.1/x64/src/sync/rwmutex.go:71
github.com/nats-io/nats-server/v2/server.(*raft).GroupLeader(0xc0001c4600?)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:1213 +0x5e
github.com/nats-io/nats-server/v2/server.(*stream).setLeader(0xc0001c4000, 0x1)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/stream.go:649 +0x13f
github.com/nats-io/nats-server/v2/server.(*jetStream).processStreamLeaderChange(0xc000154540, 0xc0001c4000, 0x1)

while itself try to get the lock from raft group that is busy recovering:

goroutine 548 [runnable]:
syscall.Syscall(0x4a4467?, 0xc000e9d4d8?, 0x800000?, 0x7ffff800000?)
    /opt/hostedtoolcache/go/1.19.1/x64/src/syscall/syscall_linux.go:68 +0x27
syscall.read(0xc0009267e0?, {0xc0321f4000?, 0x0?, 0x0?})
    /opt/hostedtoolcache/go/1.19.1/x64/src/syscall/zsyscall_linux_amd64.go:696 +0x45
syscall.Read(...)
    /opt/hostedtoolcache/go/1.19.1/x64/src/syscall/syscall_unix.go:183
internal/poll.ignoringEINTRIO(...)
    /opt/hostedtoolcache/go/1.19.1/x64/src/internal/poll/fd_unix.go:794
internal/poll.(*FD).Read(0xc0009267e0?, {0xc0321f4000?, 0x3fecac?, 0x400000?})
    /opt/hostedtoolcache/go/1.19.1/x64/src/internal/poll/fd_unix.go:163 +0x285
os.(*File).read(...)
    /opt/hostedtoolcache/go/1.19.1/x64/src/os/file_posix.go:31
os.(*File).Read(0xc11a4ca128, {0xc0321f4000?, 0x7d343c?, 0xee1d40?})
    /opt/hostedtoolcache/go/1.19.1/x64/src/os/file.go:118 +0x5e
io.ReadAtLeast({0xb9b6a0, 0xc11a4ca128}, {0xc0321f4000, 0x3fecac, 0x400000}, 0x3fecac)
    /opt/hostedtoolcache/go/1.19.1/x64/src/io/io.go:332 +0x9a
io.ReadFull(...)
    /opt/hostedtoolcache/go/1.19.1/x64/src/io/io.go:351
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadBlock(0x0?, {0x0, 0xc15b55ae00?, 0x0})
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/filestore.go:3656 +0x15d
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc000846ea0)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/filestore.go:3702 +0x2e5
github.com/nats-io/nats-server/v2/server.(*msgBlock).fetchMsg(0xc000846ea0, 0x602595, 0x48?)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/filestore.go:3751 +0xe5
github.com/nats-io/nats-server/v2/server.(*fileStore).msgForSeq(0xc000138780, 0xa5dc20?, 0x7f71e4251101?)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/filestore.go:3899 +0x138
github.com/nats-io/nats-server/v2/server.(*fileStore).LoadMsg(0x2030a3?, 0x88?, 0xc000e9d8e8?)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/filestore.go:4013 +0x19
github.com/nats-io/nats-server/v2/server.(*raft).loadEntry(0xc0001c4600, 0xc0016fc810?)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:2329 +0x4b
github.com/nats-io/nats-server/v2/server.(*raft).applyCommit(0xc0001c4600, 0x602595)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:2363 +0x125
github.com/nats-io/nats-server/v2/server.(*raft).trackResponse(0xc0001c4600, 0xc000266640)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:2486 +0x1c5
github.com/nats-io/nats-server/v2/server.(*raft).processAppendEntryResponse(0xc0001c4600, 0xc000266640)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:3028 +0x1da
github.com/nats-io/nats-server/v2/server.(*raft).runAsLeader(0xc0001c4600)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:2026 +0xb4f
github.com/nats-io/nats-server/v2/server.(*raft).run(0xc0001c4600)
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/raft.go:1631 +0x2ba
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
    /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/server.go:3077 +0x85

I think that your cluster is very unstable and wonder if you not have some looping going on, or it is simply that the disk is WAY too slow.

tvojacek commented 1 year ago

kubernetes cluster with excpetion of nats does not show stability problems. What type of disk use ngs on aws nodes?

kozlovic commented 1 year ago

@TomasVojacek From our ops team:

300 I/O requests per second seems light for something hosting general workloads in K8S where a data-heavy jetstream volume is part of it.
I'd make sure they are using a Persistent Volume Claim for the JS store, provisioned from a pool where they control the characteristics, and adjust the performance of the PV as needed.  Nothing less than 1000 IOPS if they're doing any sort of real volume. Just ls -l is going to be quite a few I/O operations.

On NGS, IOPs have been measured at more than 10,000...

tvojacek commented 1 year ago

Do we want to keep this issues open, add warning to docs and close it? Does mean that nats jetstreams are not supported on hardware with magnetic disks?

kozlovic commented 1 year ago

We could certainly add a note in the documentation. As for "Does mean that nats jetstreams are not supported on hardware with magnetic disks?": no, it doesn't mean that, but they need to have enough speed. I don't think it is the disk performance itself, but likely some imposed limits by your environment you are running on.

Have you tried with better IOPs? Are you still experiencing the issue?

tvojacek commented 1 year ago

AWS gp2 (volume 100GB) (disk latency avg 3ms max 5ms) AWS gp3 (volume 100GB) 3000 iops similar to gp2 (disk latency 3ms) AWS gp3 (volume 100GB) 6000 iops better (disk latency avg 1,5ms) not sure if it resolve problem but price of disk is too high. I was not able to test disk with higher iops on friday. But it will be even more expensive. Conclusion is that I did not find in AWS portfolio suitable disk. I will try to use memory jetstream as alternative solution.

derekcollison commented 1 year ago

I think something else is going on, although IOPS does come into play, JetStream works with the kernel to be efficient here, so I don't think that is the direct issue.

kozlovic commented 1 year ago

@derekcollison Please have a look at the investigation of the last data provided by @TomasVojacek. Unless you think otherwise, I do believe that the disk is likely the root of the issues: https://github.com/nats-io/nats-server/issues/3517#issuecomment-1268662491

derekcollison commented 1 year ago

Would need to have access to the system directly to poke a bit, but not convinced just slow disk.

tvojacek commented 1 year ago

@derekcollison if you are interested i can allow you access to cluster, trough my notebook. I'm on nats slack [tvojacek@amp.energy]

derekcollison commented 1 year ago

Like @kozlovic said we want to understand if it was a true short write or other corruption. This is the first case we have seen that tripped the intentional panic inside the server. If the system has a > R1 replication factor we can recover by doing a reset.

Would it be possible to send us the storage directory (just the meta/js) securely?

kozlovic commented 1 year ago

@derekcollison I think your comment is more for the panic on decodeConsumerState(), which is in issue https://github.com/nats-io/nats-server/issues/3535

derekcollison commented 1 year ago

Yes apologies that is correct.

derekcollison commented 1 year ago

Will close the issue but feel free to reopen if needed. If this still presents with a current server, 2.9.17 or above let us know.

nats-io / nats-server