panic in handleHeartbeatMessage

bigwhite commented 2 years ago

Note: for reported bugs, please fill in the following details. bug reports without detailed steps on how to reproduce will be automatically closed.

Dragonboat version

v3.3.5

Expected behavior

no panic. run well

Actual behavior

panic.

panic log is below:

panic: invalid commitTo index 511330, lastIndex() 52431
goroutine 581 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00047e240, {0x0, 0x0, 0x0})
    /home/tonybai/go/pkg/mod/go.uber.org/zap@v1.19.1/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*SugaredLogger).log(0xc000bd9de0, 0x4, {0x15e06c8?, 0x0?}, {0xc000612840?, 0x0?, 0xc000bd9df0?}, {0x0, 0x0, 0x0})
    /home/tonybai/go/pkg/mod/go.uber.org/zap@v1.19.1/sugar.go:227 +0xee
go.uber.org/zap.(*SugaredLogger).Panicf(...)
    /home/tonybai/go/pkg/mod/go.uber.org/zap@v1.19.1/sugar.go:159
reachauto.com/go/mqtt-gateway/common/log.(*Logger).Panicf(0xc0004056b0?, {0x15e06c8?, 0x40b1a5?}, {0xc000612840?, 0x1354ba0?, 0x0?})
    /home/tonybai/go/pkg/mod/reachauto.com/go/mqtt-gateway/common@v0.0.0-20220727093530-72844931ee90/log/log.go:190 +0xc5
github.com/lni/dragonboat/v3/logger.(*dragonboatLogger).Panicf(0xcccf?, {0x15e06c8, 0x29}, {0xc000612840, 0x2, 0x2})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/logger/logger.go:132 +0x57
github.com/lni/dragonboat/v3/internal/raft.(*entryLog).commitTo(0xc000535500, 0x7cd62)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/logentry.go:328 +0xfe
github.com/lni/dragonboat/v3/internal/raft.(*raft).handleHeartbeatMessage(_, {0x11, 0x1, 0x2, 0x66, 0x2, 0x0, 0x0, 0x7cd62, 0x0, ...})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1317 +0x45
github.com/lni/dragonboat/v3/internal/raft.(*raft).handleFollowerHeartbeat(_, {0x11, 0x1, 0x2, 0x66, 0x2, 0x0, 0x0, 0x7cd62, 0x0, ...})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1933 +0x85
github.com/lni/dragonboat/v3/internal/raft.defaultHandle(_, {0x11, 0x1, 0x2, 0x66, 0x2, 0x0, 0x0, 0x7cd62, 0x0, ...})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:2098 +0x95
github.com/lni/dragonboat/v3/internal/raft.(*raft).Handle(_, {0x11, 0x1, 0x2, 0x66, 0x2, 0x0, 0x0, 0x7cd62, 0x0, ...})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1483 +0x275
github.com/lni/dragonboat/v3/internal/raft.(*Peer).Handle(_, {0x11, 0x1, 0x2, 0x66, 0x2, 0x0, 0x0, 0x7cd62, 0x0, ...})
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/peer.go:195 +0x185
github.com/lni/dragonboat/v3.(*node).handleReceivedMessages(0xc0003c0600)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1275 +0x358
github.com/lni/dragonboat/v3.(*node).handleEvents(0xc0003c0600)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1133 +0x73
github.com/lni/dragonboat/v3.(*node).stepNode(_)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1111 +0x115
github.com/lni/dragonboat/v3.(*engine).processSteps(0xc0000bfa40, 0xc000bdbda8?, 0xc0064ede38?, 0xc000117680, {0x22b94d8, 0x1?, 0x0}, 0xc000351ce0?)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1279 +0x25d
github.com/lni/dragonboat/v3.(*engine).stepWorkerMain(0xc0000bfa40, 0x7)
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1215 +0x2be
github.com/lni/dragonboat/v3.newExecEngine.func1()
    /home/tonybai/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1017 +0x68
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
    /home/tonybai/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:79 +0xc5
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker
    /home/tonybai/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:74 +0xea

Steps to reproduce the behavior

the panic is not always present. we have met 2 times.

this time, we did some load test. we store msg in 2000 tps/second to the three-node raft cluster. after sometime, one of the cluster node panic.

lni commented 2 years ago

@bigwhite please provide more details on the setup of your above mentioned run that triggered the panic. for example, did you stop/restart any node in between, did you do anything else? Providing the full log will also help.

From the above limited log, which is totally not enough for any analysis, basically it shows that the heartbeat message claims that the committed index is at least 511330 while your crashed node only has log up to index 52431 only. Note that it happened after your node explicitly acked to the leader that it at least has index up to 511330. See code linked below.

https://github.com/lni/dragonboat/blob/2da3403da76f54ff5466db5c07434e1e361582ec/internal/raft/raft.go#L817

I can see that you are using a zap wrapper as your logger, did you change anything in dragonboat? What is your state machine type? In memory or on disk? You basically need to figure out why those raft log between index 52431 and 511330 disappeared.

bigwhite commented 2 years ago

Unfortunately, some of the logs have been deleted. But as far as I know, we did no operation to the cluster except store msg to cluster.

I did not change anything in dragonboat and I do use zap wrapper:). our state machine is in memory.

thanks for your reply. I am going to do some investigation. I will supply more log detail if the problem occur next time.

bigwhite commented 2 years ago

I found the log below in another node which is not panic:

{"level":"error","logtime":"2022-09-14T15:32:26+08:00","caller":"plugin/plugin.go:69","msg":"hook panic","service":"subscriber","ip":"192.168.19.161","mountPoint":"topic_created","param":{"topic":"g5/zDA/i167440926"},"panic":"runtime error: invalid memory address or nil pointer dereference"}
2022-09-14 15:32:54.377982 W | logdb: %!s(uint64=1052430) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.441198 W | logdb: %!s(uint64=1104858) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.494555 W | logdb: %!s(uint64=1157286) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.667728 W | logdb: %!s(uint64=1209714) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.866846 W | logdb: %!s(uint64=1262142) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.066838 W | logdb: %!s(uint64=1314570) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.267871 W | logdb: %!s(uint64=1366998) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.467373 W | logdb: %!s(uint64=1419426) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:33:29.892237 W | dragonboat: StaleRead called, linearizability not guaranteed for stale read
2022-09-14 15:33:33.964811 I | dragonboat: LogDB info received, shard 5, busy false
2022-09-14 15:33:33.995854 I | dragonboat: LogDB info received, shard 6, busy false
2022-09-14 15:33:34.000783 I | dragonboat: LogDB info received, shard 7, busy false

Could this log be related to the problem?

lni commented 2 years ago

There is something very strange - basically you are saying that you just have 3 nodes running without interruption, somehow one of the node will have some of its persisted & acknowledged log entries lost without restarting. Given such most basic feature has been battle tested for years deployed in dozens of projects, I just couldn't understand how this is possible.

for the above new log you provided, it does include info showing an invalid memory address or nil pointer dereference error, why the node didn't crash immediately after that?

lni commented 2 years ago

Please re-run your program with logging verbosity level set to DEBUG, if you manage to get the panic again, please provide the full log of your program so we can look into it.

lni / dragonboat