Placed an entry at the wrong index

mikluko commented 2 years ago

Defect

Versions of `nats-server` and affected client libraries used:

NATS Server v2.3.2, deployed by Helm chart v0.8.4

OS/Container environment:

RKE 1.2.9, Kubernetes 1.2.80, Docker 20.10.7

Steps or code to reproduce the issue:

It's a 5-node cluster. The cluster was under use for quite some time before it was restarted with kubectl rollout restart statefulset/nats.

Expected result:

All cluster pods restarted one by one, in the rolling manner.

Actual result:

4 out of 5 pods restart successfully. One goes into crash cycle with the following logging output tail:

[83] 2021/07/31 07:45:33.165840 [DBG] RAFT [cnrtt3eg - C-R3F-er29H0Re] Update peers from leader to map[bkCGheKT:0xc0005ce100 cnrtt3eg:0xc00013f120 yrzKKRBu:0xc0005ce0f0]
panic: [nats-2-S-R3F-VkruMIMt] Placed an entry at the wrong index, ae is &{leader:yrzKKRBu term:34 commit:208 pterm:34 pindex:208 entries: 1}, seq is 218, n.pindex is 208

goroutine 154 [running]:
github.com/nats-io/nats-server/server.(*raft).storeToWAL(0xc000b44000, 0xc07fab7100, 0xe99920, 0x8)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2774 +0x357
github.com/nats-io/nats-server/server.(*raft).processAppendEntry(0xc000b44000, 0xc07fab7100, 0xc08e06a0c0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2636 +0x8c5
github.com/nats-io/nats-server/server.(*raft).runAsFollower(0xc000b44000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1464 +0x1ce
github.com/nats-io/nats-server/server.(*raft).run(0xc000b44000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1416 +0xed
created by github.com/nats-io/nats-server/server.(*Server).startGoRoutine
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2827 +0xc5

goroutine 154 [running]:
github.com/nats-io/nats-server/server.(*raft).storeToWAL(0xc000b44000, 0xc07fab7100, 0xe99920, 0x8)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2774 +0x357
github.com/nats-io/nats-server/server.(*raft).processAppendEntry(0xc000b44000, 0xc07fab7100, 0xc08e06a0c0)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2636 +0x8c5
github.com/nats-io/nats-server/server.(*raft).runAsFollower(0xc000b44000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1464 +0x1ce
github.com/nats-io/nats-server/server.(*raft).run(0xc000b44000)
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1416 +0xed
created by github.com/nats-io/nats-server/server.(*Server).startGoRoutine
    /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2827 +0xc5

Full log

derekcollison commented 2 years ago

Thanks, would it be possible to privately send us the complete storage directory for each server?

carr123 commented 2 years ago

i encounter the same problem today. i made thousands of MQTT connections to nats cluster, minutes later one node crashed. then i restart the crashed node, i notice it takes too long(more than 60 seconds) to listen on TCP PORT 4222, 8222, 1833 after the process started. i stopped testing. no heavy workload any more. only a few nats requests other programs made. a few minutes later, another node crashed.

it seems something wrong with the raft logic.

here is the log:

panic: [server226-S-R2F-ziqQhGPV] Placed an entry at the wrong index, ae is &{leader:1txmfIKU term:184 commit:0 pterm:0 pindex:0 entries: 1}, seq is 4, n.pindex is 0

goroutine 14561 [running]: github.com/nats-io/nats-server/server.(raft).storeToWAL(0xc000dee2c0, 0xc00673ce80, 0xf, 0xc0364cca40) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2774 +0x357 github.com/nats-io/nats-server/server.(raft).processAppendEntry(0xc000dee2c0, 0xc00673ce80, 0xc0066c26c0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2636 +0x8c5 github.com/nats-io/nats-server/server.(raft).runAsFollower(0xc000dee2c0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1464 +0x1ce github.com/nats-io/nats-server/server.(raft).run(0xc000dee2c0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1416 +0xed created by github.com/nats-io/nats-server/server.(*Server).startGoRoutine /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2839 +0xc5

derekcollison commented 2 years ago

Any other details on what led up to the first crash? This is a forced panic since in our code due to state corruption. Were you killing terminals on windows in this test as well?

carr123 commented 2 years ago

this is on Ubuntu18.04，3 nodes cluster on 3 machines，i was testing mqtt connection pressure，i didn't close any node on purpose before the panic happen。i am not sure what led to the first crash，maybe too frequent MQTT connection，maybe the raft state got messed。 the crashed node received about 5000 TCP connections before it crashed。 then another node fail to receive more TCP connections although it didn't crash。 the last node continue runs well，it can accept 15K connections before i stop the test。

在 2021年9月1日，21:08，Derek Collison @.***> 写道：

Any other details on what led up to the first crash? This is a forced panic since in our code due to state corruption. Were you killing terminals on windows in this test as well?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

derekcollison commented 2 years ago

And server version 2.4.0 correct?

carr123 commented 2 years ago

yes, linux amd64 server version 2.4.0 ----- 原始邮件 ----- 发件人：Derek Collison @.> 收件人：nats-io/nats-server @.> 抄送人：carr123 @.>, Comment @.> 主题：Re: [nats-io/nats-server] Placed an entry at the wrong index (#2391) 日期：2021年09月01日 22点35分

And server version 2.4.0 correct?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

derekcollison commented 2 years ago

Thanks.

carr123 commented 2 years ago

maybe after the last issue i reported fixed，this one could disappear as well. I will test it again after new server version come out. thanks to your efforts.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

carr123 commented 2 years ago

@derekcollison i leave office, no testing clients running. 12 hours later, i came back, one node crashed. this time the panic log is different, it worth a look. you can download from http://101.200.84.208/nohup.txt this download link expires in 3 days.

derekcollison commented 2 years ago

Seems the system you are running on has a hard limit of 1000 threads. That is at the top there. What system is this on?

derekcollison commented 2 years ago

How many consumers were you trying to create?

derekcollison commented 2 years ago

And how many streams? Seems quite a very large amount of both..

derekcollison commented 2 years ago

What does the nats cli report via the following?

nats stream report

kozlovic commented 2 years ago

@derekcollison If the question is to @carr123, the report says that many MQTT connections were created. As you know, each MQTT session will have its own stream (max msgs of 1), so that may explain the number of streams that you are seeing?

carr123 commented 2 years ago

ubuntu18.04, amd64 not 1000 but 10,000 threads. (is this OS threads ? if it is, seems too many for a single process)

i tested mqtt connection 12 hours ago, there are 5327 folders there under jetstream directory, it seems one folder for one connection. each folder named like $MQTT_sess_zlNVCJTT

nats stream report (this command is executed on a live node, not the crashed one, server227 crashed) results: http://101.200.84.208/stream.txt

carr123 commented 2 years ago

i restart the crashed node server227, and find its threads grows slowly from less than 10 to near 200. and continue to grow.

http://101.200.84.208/threads.png

carr123 commented 2 years ago

server227 threads grow up to 428 and stop growing. but server226 begin to crash. i noticed this yestoday, server226 and server 227 take turns to crash. another node server225 is alway healthy.

server226 panic log: panic: [server226-S-R2F-TM5kP7eZ] Placed an entry at the wrong index, ae is &{leader:1txmfIKU term:182 commit:0 pterm:179 pindex:2 entries: 1}, seq is 5, n.pindex is 2

goroutine 21454 [running]: github.com/nats-io/nats-server/server.(raft).storeToWAL(0xc00a48d080, 0xc36268c080, 0xf, 0xc36e659fa0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2774 +0x357 github.com/nats-io/nats-server/server.(raft).processAppendEntry(0xc00a48d080, 0xc36268c080, 0xc16002fbc0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:2636 +0x8c5 github.com/nats-io/nats-server/server.(raft).runAsFollower(0xc00a48d080) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1464 +0x1ce github.com/nats-io/nats-server/server.(raft).run(0xc00a48d080) /home/travis/gopath/src/github.com/nats-io/nats-server/server/raft.go:1416 +0xed created by github.com/nats-io/nats-server/server.(*Server).startGoRoutine /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2839 +0xc5

hope those logs can give you some clue.

carr123 commented 2 years ago

after server226 crashed, i restart it, watch its threads.
thread num grow up to 2084 on server226, then server227 crash again. this is captured on server226 : http://101.200.84.208/thread226.png maybe you should restrict the thread num one process can create.

derekcollison commented 2 years ago

Your cluster and the machines you are running do not have enough resources currently to run that many MQTT connections.

We have plans to mux streams for MQTT persistence, but until then you need larger machines.

carr123 commented 2 years ago

hi, i just download nats-server main branch, build my own binary. nats-server : ubuntu18.04, 3 nodes cluster(server225, server226, server227) (8 GB RAM, 4 core for each node) MQTT client : windows i delete old jetstream data, make a clean cluster. i didn't kill nodes on purpose through the following test.

client code like this:

func PressureTest() { for i := 0; i < 10000; i++ { idx := rand.Intn(3) if idx == 0 { go connectNode1() }else idx == 1 { go connectNode2() }else { go connectNode3() } time.Sleep(time.Millisecond 100) if i % 200 == 0 { log.Println(i, " connects") time.Sleep(time.Second) } } time.Sleep(time.Hour100) }

when the test started, i notice thread count on each node, they are reasonable(12 threads, 11 threads, 12 threads), as connection count increase, thread count increase slowly, when there are 2861 connections on node1, threads count are 20, 25, 19 on each node, it looks still reasonable.

suddenly, server225 crashed, then thread count on server226 increase to 854, thread count on server227 increase to 153. so , i guess some special code on server226, server227 cause thread count jump too high(if no node crash, they are not high).

server255 crash log: http://101.200.84.208/nats_nohup.txt from the first line in the log, i see "SIGQUIT: quit", i did not send this signal manually, i was watching the thread count value. there are 431136 goroutines shown in the log, it seems too many for <5000 client connections.

derekcollison commented 2 years ago

Currently a MQTT connection has its own stream per connection which again we can now update and improve by using the K/V functionality of JetStream to mux 1 stream for all MQTT clients on a server. This work is being prioritized.

carr123 commented 2 years ago

Great, i won't test against mqtt until it's improved.

derekcollison commented 2 years ago

Will keep you posted. Thanks.

carr123 commented 2 years ago

@derekcollison hello, first congratulations for nats-server release v2.5.0. i test mqtt againt nats-server 2.5.0 linux-amd64. (i build my own binary, in order to add pprof ) i made 1000 mqtt client connections

notice there are 1001 subdirectories under $MQTT_msgs/obs/, i doubt if i made 100K connections, whether the folder count will exceed linux limits.
also notice there are too many(7046) goroutines, maybe they cost too much RAM space. the file writer and raft logic takes many goroutines. i'm not sure if these code could be optimized.
sometimes, clients connection fail ("Connecting to tcp://10.91.26.226:1883 CONNACK was not CONN_ACCEPTED, but rather Connection Error"), then subdirectory count under $MQTT_msgs/obs/ still increase, the goroutine count increase accordingly. and these folder are never deleted even after server restart.

you mentioned "mux 1 stream for all MQTT clients on a server", the sessions are now indeed merged into a single $MQTT_sess folder. i wonder whether the obs subfolders could be merged then goroutines and memory usage are saved.

thanks.

below are logs from http/pprof: https://github.com/carr123/natsjsmdemo/blob/main/goroutine.txt

derekcollison commented 2 years ago

@carr123, @kozlovic did most of the work here so will let him chime in regarding architecture.

Go routines are designed to be very lightweight, so not too overly concerned about using them for dedicated I/O. IIRC stack memory footprint is 8k per.

carr123 commented 2 years ago

well，a connection cost N goroutines，that design philosophy is OK。what I saw is 4000 MQTT connections（no data sent yet）make one node cost 8GB RAM , that's too expensive i think。even after 4000 clients disconnected and node restarted，Nats still take that much memory。 another thing i saw is there maybe goroutine leak， when one node crash ，another node create 50000 goroutines before it panic & exit

发自我的iPhone

在 2021年9月10日，22:24，Derek Collison @.***> 写道：

@carr123, @kozlovic did most of the work here so will let him chime in regarding architecture.

Go routines are designed to be very lightweight, so not too overly concerned about using them for dedicated I/O. IIRC stack memory footprint is 8k per.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

derekcollison commented 2 years ago

Let's see what @kozlovic has to say next week.

kozlovic commented 2 years ago

@carr123 Are those connections creating each a subscription? That this the only reason I can think of that you see a file per connection under $MQTT_msgs/obs. If your connections are not created with the "clean session" flag, the server has to maintain the state when shutting down. So this is normal they persist.

The reason for so many go routines and memory usage is that for go routines we have 2 per connections (the read loop and write loop, regardless of MQTT client or standard NATS), then you have go routines for the RAFT related things for consumer leaders, etc.. Each consumer (that is one per your MQTT subscription) will have a raft group associated. Each group create many things, including these kind of objects:

        propc:    make(chan *Entry, 8192),
        entryc:   make(chan *appendEntry, 32768),
        respc:    make(chan *appendEntryResponse, 32768),
        applyc:   make(chan *CommittedEntry, 32768),

and maybe there are some buffers for each file store (where messages/consumers/etc..), which can explain the memory usage? When running in standalone mode, I don't see a "lot" of memory usage for 4,000 connections with 1 sub each. In clustered mode, much more, but creating those 4,000 connections is very slow. I am only at 2,200 as I type and each server is already at 2.5 GB.

I will defer to @derekcollison for that.

carr123 commented 2 years ago

thanks for your detailed description about nats design. i created mqtt connections in a loop, each connection has fixed client_id(c1, c2, c3, ...), sub 1 topic, clean_session=false, and they connect evenly to a 3-node cluster. that 's emulating real world scene. when i said 1000 connections, means 1000 for each node. i haven't test on standaone server yet. next week i will have a try. i agree with you that there are might be buffers that occupy much memory, if it's optimized, then a single node can sustain more
connections. i expect one node can hold 100K clients with light data traffic. that would be nice. if you want to repoduce other issue i mentioned above, just continue making more connections till one node crash. then watch things happen on other nodes.

nats-io / nats-server