Closed bigwhite closed 2 years ago
@bigwhite please provide more details on the setup of your above mentioned run that triggered the panic. for example, did you stop/restart any node in between, did you do anything else? Providing the full log will also help.
From the above limited log, which is totally not enough for any analysis, basically it shows that the heartbeat message claims that the committed index is at least 511330 while your crashed node only has log up to index 52431 only. Note that it happened after your node explicitly acked to the leader that it at least has index up to 511330. See code linked below.
I can see that you are using a zap wrapper as your logger, did you change anything in dragonboat? What is your state machine type? In memory or on disk? You basically need to figure out why those raft log between index 52431 and 511330 disappeared.
Unfortunately, some of the logs have been deleted. But as far as I know, we did no operation to the cluster except store msg to cluster.
I did not change anything in dragonboat and I do use zap wrapper:). our state machine is in memory.
thanks for your reply. I am going to do some investigation. I will supply more log detail if the problem occur next time.
I found the log below in another node which is not panic:
{"level":"error","logtime":"2022-09-14T15:32:26+08:00","caller":"plugin/plugin.go:69","msg":"hook panic","service":"subscriber","ip":"192.168.19.161","mountPoint":"topic_created","param":{"topic":"g5/zDA/i167440926"},"panic":"runtime error: invalid memory address or nil pointer dereference"}
2022-09-14 15:32:54.377982 W | logdb: %!s(uint64=1052430) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.441198 W | logdb: %!s(uint64=1104858) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.494555 W | logdb: %!s(uint64=1157286) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.667728 W | logdb: %!s(uint64=1209714) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:54.866846 W | logdb: %!s(uint64=1262142) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.066838 W | logdb: %!s(uint64=1314570) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.267871 W | logdb: %!s(uint64=1366998) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:32:55.467373 W | logdb: %!s(uint64=1419426) limited high to %!d(MISSING) in logReader.entriesLocked
2022-09-14 15:33:29.892237 W | dragonboat: StaleRead called, linearizability not guaranteed for stale read
2022-09-14 15:33:33.964811 I | dragonboat: LogDB info received, shard 5, busy false
2022-09-14 15:33:33.995854 I | dragonboat: LogDB info received, shard 6, busy false
2022-09-14 15:33:34.000783 I | dragonboat: LogDB info received, shard 7, busy false
Could this log be related to the problem?
There is something very strange - basically you are saying that you just have 3 nodes running without interruption, somehow one of the node will have some of its persisted & acknowledged log entries lost without restarting. Given such most basic feature has been battle tested for years deployed in dozens of projects, I just couldn't understand how this is possible.
for the above new log you provided, it does include info showing an invalid memory address or nil pointer dereference error, why the node didn't crash immediately after that?
Please re-run your program with logging verbosity level set to DEBUG, if you manage to get the panic again, please provide the full log of your program so we can look into it.
Dragonboat version
v3.3.5
Expected behavior
no panic. run well
Actual behavior
panic.
panic log is below:
Steps to reproduce the behavior
the panic is not always present. we have met 2 times.
this time, we did some load test. we store msg in 2000 tps/second to the three-node raft cluster. after sometime, one of the cluster node panic.