Jetstream can't restore after a network breakdown

wchajl commented 2 years ago

I’m deploying a Jetstream cluster with 3 nodes in k8s. It occurs that Jetstream doesn’t work after a network breakdown . The same issue occurred yesterday. And it can only resume after manual restarting the servers. ![Uploading image.png…]()

wchajl commented 2 years ago

MobaXterm_KFJ0WSUM46

ripienaar commented 2 years ago

Please try to copy and paste the logs or download or something. Best to see all the logs

wchajl commented 2 years ago

@ripienaar idms-dev-nats-2.log

ripienaar commented 2 years ago

Thanks that will help so so e look into it

derekcollison commented 2 years ago

Which version of the server?

kozlovic commented 2 years ago

@wchajl Few questions:

What type of consumers are you using? push or pull?
If push, did not notice a lot of pending messages on those? (with nats c report or nats c info on a particular one)
Would you be willing to capture some stack traces of the remote servers (not necessarily the one where you see the route slow consumer, but the other side, or both if they both complain)? I notice that you run v2.7.3 and have the monitoring port enabled, so that would mean hitting the http://host:port/stacksz endpoint. Alternatively, you could run with the -profile <port> enabled and then we can use go tool pprof http://host:<profile port>/debug/pprof/profile?seconds=5 and you could send us the output file or check it yourself.

If the slow down is due to ack processing on the other side, then PR https://github.com/nats-io/nats-server/pull/2898 will help, but it is not yet merged.

wchajl commented 2 years ago

@kozlovic I resumed this cluster by killing the pods yesterday, so I'm afraid that I cant get the stack information .

We are using push consumer.
After restarting the cluster , I got the info of consumers . There are losts of messages , but I can't be sure it was the same since jetstream api didn't work that time . I will save the stack log when it happens again.
I had tried to use natscli to sub a subject from one node and pub a msg from another in that cluster ,and nothing was received. It seems that the cluster connections wouldn't work that time. And when I tried to send msgs to a stream , it would report "no responder" errror, meanwhile stream in another account can be inspected.

joriatyBen commented 2 years ago

wchajl commented 2 years ago

@kozlovic It happens again. nats-log.log stacksz.log

kozlovic commented 2 years ago

@wchajl I wonder if the stacksz is complete or not, but it was enough to see that your system has a problem because there is a go routine stuck in a system call for more than 2 minutes:

goroutine 2798914 [syscall, 2 minutes]:
syscall.Syscall(0x4a, 0xd, 0x0, 0x0)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/syscall/asm_linux_amd64.s:20 +0x5
syscall.Fsync(0x8)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/syscall/zsyscall_linux_amd64.go:414 +0x30
internal/poll.(*FD).Fsync.func1(...)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/internal/poll/fd_fsync_posix.go:19
internal/poll.ignoringEINTR(...)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/internal/poll/fd_posix.go:75
internal/poll.(*FD).Fsync(0x413586)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/internal/poll/fd_fsync_posix.go:18 +0xf4
os.(*File).Sync(0xc00022a330)
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/os/file_posix.go:169 +0x4e
github.com/nats-io/nats-server/v2/server.(*fileStore).syncBlocks(0xc00032c000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:2925 +0x192
created by time.goFunc
    /home/travis/.gimme/versions/go1.17.7.linux.amd64/src/time/sleep.go:180 +0x31

This go routine is holding some lock that then causes other routines to be blocked waiting for this lock to be released.

I am not sure that explains the route "Write Error" or "Stale Connection", but instead I think that those are the result of the same issue of losing network connectivity. I assume that your storage is maybe network attached too?

kozlovic commented 2 years ago

Also, I would strongly recommend that you use the latest release, since version 2.7.3 had some issues.

wchajl commented 2 years ago

@kozlovic Thanks for your reply. We deployed nats-server in three k8s clusters with the storage class of ceph . And the same issue happened in all clusters. I've updated nats-server to 2.7.4. But the jetstream in one cluster seems to be corrupted , I'm not sure whether it's the same cause. One node in the cluster restores while the other 2 failed and keep restarting,and they cant get restored event after manually killing them.
The logs indicates that there was a panic: corrupt state file . I'm pretty confused,could you please give me some suggestions on how to restore the jetstream ? thanks! stacksz.log nats-1.log nats_0_stacksz.log nats0_previous.log

derekcollison commented 2 years ago

We need to verify that the ceph setup is correct and the system is not inadvertently sharing the same directory/file from multiple servers.

We have a few folks on our side who are more versed in ceph, and k8s than I.

/cc @philpennock @bwerthmann @caleblloyd

kozlovic commented 2 years ago

@wchajl Looks like the 2 stacks that you provided show that the server is actively recovering the state since the routine reading blocks has no indication that it is blocked:

goroutine 1 [syscall]:
syscall.Syscall(0x0, 0x7, 0xc004734000, 0xfff12b)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/syscall/asm_linux_amd64.s:20 +0x5
syscall.read(0xc0003098c0, {0xc004734000, 0x0, 0x42c0a0})
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/syscall/zsyscall_linux_amd64.go:687 +0x4d
syscall.Read(...)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/syscall/syscall_unix.go:189
internal/poll.ignoringEINTRIO(...)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/internal/poll/fd_unix.go:582
internal/poll.(*FD).Read(0xc0003098c0, {0xc004734000, 0xfff12b, 0xfff12b})
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/internal/poll/fd_unix.go:163 +0x285
os.(*File).read(...)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/os/file_posix.go:32
os.(*File).Read(0xc0000b00f0, {0xc004734000, 0xc00016b9b8, 0x44c512})
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/os/file.go:119 +0x5e
io.ReadAtLeast({0xaf18e0, 0xc0000b00f0}, {0xc004734000, 0xfff12b, 0xfff12b}, 0xfff12b)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/io/io.go:328 +0x9a
io.ReadFull(...)
    /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/io/io.go:347
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadBlock(0xc0000001a0, {0x0, 0x0, 0x0})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3236 +0x136
github.com/nats-io/nats-server/v2/server.(*msgBlock).loadMsgsWithLock(0xc001571860)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:3282 +0x2e9
github.com/nats-io/nats-server/v2/server.(*msgBlock).generatePerSubjectInfo(0xc001571860)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:4415 +0xef
github.com/nats-io/nats-server/v2/server.(*msgBlock).readPerSubjectInfo(0xc001571860)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:4467 +0x394
github.com/nats-io/nats-server/v2/server.(*fileStore).recoverMsgBlock(0xc000392280, {0xaffff8, 0xc00028bba0}, 0x13c5)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:611 +0xff0
github.com/nats-io/nats-server/v2/server.(*fileStore).recoverMsgs(0xc000392280)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:881 +0x858
github.com/nats-io/nats-server/v2/server.newFileStoreWithCreated({{0xc000228750, 0x25}, _, _, _, _}, {{0xc000233370, 0x9}, {0x0, 0x0}, ...}, ...)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/filestore.go:318 +0x774
github.com/nats-io/nats-server/v2/server.(*stream).setupStore(0xc000392000, 0xc00016ca90)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:2459 +0x2d5
github.com/nats-io/nats-server/v2/server.(*Account).addStreamWithAssignment(0xc000127d40, 0xc000380018, 0x0, 0x0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:413 +0xdcb
github.com/nats-io/nats-server/v2/server.(*Account).addStream(0xc000233380, 0xc000232930)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:264 +0x1d
github.com/nats-io/nats-server/v2/server.(*Account).EnableJetStream(0xc000127d40, 0xe0a3a0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:1159 +0x2be5
github.com/nats-io/nats-server/v2/server.(*Server).configJetStream(0xc0001c6000, 0xc000127d40)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:630 +0x9c
github.com/nats-io/nats-server/v2/server.(*Server).configAllJetStreamAccounts(0xc0001c6000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:677 +0x176
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStreamAccounts(0xc00024e370)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:563 +0xb1
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStream(0xc0001c6000, {0x40000000, 0x3200000000, {0xc000232450, 0xf}, {0xc00019e99d, 0xb}})
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:377 +0x919
github.com/nats-io/nats-server/v2/server.(*Server).EnableJetStream(0xc0001c6000, 0xc00016de38)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream.go:189 +0x40c
github.com/nats-io/nats-server/v2/server.(*Server).Start(0xc0001c6000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:1759 +0xf65
github.com/nats-io/nats-server/v2/server.Run(...)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/service.go:22
main.main()
    /home/travis/gopath/src/github.com/nats-io/nats-server/main.go:118 +0x2fa

Between stacksz.log and nats_0_stacksz.log (which is from the same server), I see a 2 minutes difference in other go routines, but that specific go routine does not show being stuck in one system call, hence my assumption that it is actively recovering.

The nats-1.log shows that it is isolated from the other servers, or that there are routes that are no longer reachable:

[12] 2022/03/22 11:21:59.255414 [DBG] Trying to connect to route on idms-native-test-nats-2.idms-native-test-nats.idms-nats.svc.cluster.local:6222 (100.101.145.66:6222)
[12] 2022/03/22 11:21:59.256652 [ERR] Error trying to connect to route (attempt 3923): dial tcp 100.101.145.66:6222: connect: connection refused
[12] 2022/03/22 11:21:59.340148 [DBG] Trying to connect to route on idms-native-test-nats-0.idms-native-test-nats.idms-nats.svc.cluster.local:6222 (100.101.138.72:6222)
[12] 2022/03/22 11:21:59.340604 [ERR] Error trying to connect to route (attempt 255): dial tcp 100.101.138.72:6222: connect: connection refused

Regarding the logs, nats0_previous.log, which shows that it started and is a 2.7.4 server, took more than 20 minutes to recover the state for about 10 million messages and 5 consumers:

[7869] 2022/03/22 10:55:30.843584 [DBG] Recovering JetStream state for account "CEC"
[7869] 2022/03/22 11:17:41.990194 [INF]   Restored 9,902,004 messages for stream 'CEC > cecstream'
[7869] 2022/03/22 11:17:41.990948 [INF]   Recovering 5 consumers for stream - 'CEC > cecstream'

It also shows the panic decoding the consumer state:

[7869] 2022/03/22 11:17:42.773338 [DBG] Starting consumer monitor for 'CEC > cecstream > eventPersister2' [C-R3F-YzqQD7aI]
(...)
[7869] 2022/03/22 11:17:42.804898 [ERR] JetStream cluster could not decode consumer snapshot for 'CEC > cecstream > eventPersister2' [C-R3F-YzqQD7aI]
[7869] 2022/03/22 11:17:42.804939 [DBG] Exiting consumer monitor for 'CEC > cecstream > eventPersister2' [C-R3F-YzqQD7aI]
panic: corrupt state file

goroutine 4603 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyConsumerEntries(0xc0001306e0, 0xc000221000, 0x0, 0x0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3167 +0x79c
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorConsumer(0xc000174000, 0xc000221000, 0xc002226a20)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3134 +0x835
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer.func1()
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2923 +0x25
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3024 +0x87

I will let @derekcollison comment on that on what could be done to skip that (maybe delete the corresponding consumer directory?)

Finally, you mentioned that you have deployed in 3 different k8s clusters, are you saying that you have 3 NATS clusters connected with gateways, or is it that you have 1 NATS cluster of 3 nodes, each node running in a k8s cluster? If the former, you need to make sure that each node has a unique server name! The server detects duplicate server names within a cluster, but does not across gateways (this will be the case in the next release).

wchajl commented 2 years ago

@kozlovic I have 3 isolated nats clusters (dev , native-test , test) ,and each cluster consists of 3 nodes .

The logs I uploaded 15 days ago are from dev cluster. I've upgraded the nats-server to 2.7.3 after the issue occurs, and it worked for a while .
The log/stacks I uploaded 3 days ago are from test cluster . I've upgraded that cluster since then ,and there is now issue report so far.
The logs/stacks I uploaded yesterday are from native-test cluster whose nats version is 2.7.4 . Nodes in this cluster are restarting , and all nodes report "corrupt state file".
And now the dev cluster is corrupted again. Looks like exactly the problem happened 15 days ago. It could back to work after I manually delete pods . I've upgraded it to 2.7.4. And here is log/stacksz of each node. dev-0.log dev-0-stacksz.log dev-1.log dev-1-stacksz.log dev-2.log dev-2-stacksz.log

kozlovic commented 2 years ago

I've upgraded it to 2.7.4. And here is log/stacksz of each node.

From the all stack traces, I can tell that this is v2.7.3, not v2.74 (lines don't match for 2.7.4, but do for 2.7.3). Also, for all logs but dev-0.log that does not have the starting banner, they show that this is a v2.7.3:

[19] 2022/03/09 08:40:33.652384 [INF] Starting nats-server
[19] 2022/03/09 08:40:33.654083 [INF]   Version:  2.7.3
[19] 2022/03/09 08:40:33.654105 [INF]   Git:      [1712ee3]
[19] 2022/03/09 08:40:33.654127 [INF]   Name:     idms-dev-nats-1
[19] 2022/03/09 08:40:33.654144 [INF]   Node:     jD6ufzcE
[19] 2022/03/09 08:40:33.654152 [INF]   ID:       NA4TOYIIN7OIVBAA2EH3HJGUZIPSKRKNFDHHUDVUJTGERNQJGDVKHHRH
[19] 2022/03/09 08:40:33.654185 [WRN] Plaintext passwords detected, use nkeys or bcrypt
[19] 2022/03/09 08:40:33.654227 [INF] Using configuration file: /nats.conf
[19] 2022/03/09 08:40:33.658270 [INF] Starting http monitor on 0.0.0.0:8222
[19] 2022/03/09 08:40:33.658426 [INF] Starting JetStream

I have 3 isolated nats clusters

Then why do you have gateways? If there is any chance that these "isolated" clusters are gateway connected, and nodes in a given cluster share the same name than nodes in another cluster, this would cause issues. So if they are really isolated, then you should remove the gateway definition on the config of each node. I see a lot of authentication failures on the gateway port, things like that:

[19] 2022/03/20 00:06:44.984063 [INF] 100.101.13.129:57698 - gid:141112 - Processing inbound gateway connection
[19] 2022/03/20 00:06:44.998957 [ERR] 100.101.13.129:57698 - gid:141112 - authentication error - User ""
[19] 2022/03/20 00:06:44.999356 [INF] 100.101.13.129:57698 - gid:141112 - Gateway connection closed: Authentication Failure

and even external processes (non NATS) trying to connect to the gateway (and clients) ports: Gateway:

[19] 2022/03/20 00:06:45.074055 [INF] 100.101.138.106:32922 - gid:141115 - Processing inbound gateway connection
[19] 2022/03/20 00:06:45.091028 [ERR] 100.101.138.106:32922 - gid:141115 - Gateway parser ERROR, state=10, i=1: proto='"lient string for PARC version 1 "...'
[19] 2022/03/20 00:06:45.091079 [INF] 100.101.138.106:32922 - gid:141115 - Gateway connection closed: Protocol Violation

Client:

[19] 2022/03/19 13:38:21.576847 [ERR] 100.101.145.15:33690 - cid:139625 - Client parser ERROR, state=10, i=1: proto='"lient string for PARC version 1 "...'

So although this is v2.7.3, and not the latest, I believe that stacks from dev-1-stacksz.log shows a deadlock. So that is a "good" thing in that it identifies a deadlock that will likely need fixing in the upcoming release (I think it may still be present in v2.7.4 - need to do more research).

We will work on that, in the meantime, it would be good that you remove gateways if not needed and make sure your servers have unique names (in case one day you need to interconnect them all). Note that if you do have to change the server name, it is likely that you will need to wipe the state (I understand that this is dev/test environment) because the server names is hash'ed and used in RAFT protocols.

wchajl commented 2 years ago

@kozlovic Thanks for your advice. But we intend to reserve the gateways so new clusters could join it without modifying current config. Maybe I should add some authorization to protect it ？

kozlovic commented 2 years ago

@wchajl You should add some protection for sure (unless this is only for dev/test env). Notice that you already see that there are authentication errors ("Gateway connection closed: Authentication Failure") because the process that tries to connect is not sending the proper CONNECT protocol. If you keep gateways, please make sure that your server names will be unique across the super-cluster.

nats-io / nats-server

Jetstream can't restore after a network breakdown #2912