harmony-one / harmony

The core protocol of harmony
https://harmony.one
GNU Lesser General Public License v3.0
1.47k stars 289 forks source link

Node(v4.3.1) crashed with "panic: runtime error: invalid memory address or nil pointer dereference" #3955

Closed fish2plain closed 1 year ago

fish2plain commented 2 years ago

Describe the bug

Harmony node with shard0 crashed few hours after upgraded to v4.3.1.

ubuntu@harmony-s0-lax:~$ ./harmony version
Harmony (C) 2020. harmony, version v7211-v4.3.1-0-g65614950 (runner@ 2021-11-27T05:27:53+0000)

To Reproduce not reproducible so far

Expected behavior node stable

Screenshots stack trace:

Staking mode; node key 
...

 -> shard 0
Started RPC server at: 127.0.0.1:9500
Started Auth-RPC server at: 127.0.0.1:9501
Started WS server at: 127.0.0.1:9800
Started Auth-WS server at: 127.0.0.1:9801

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11c1625]

goroutine 6210888 [running]:
github.com/harmony-one/harmony/core/types.(*Block).Epoch(...)
        /home/runner/work/harmony/harmony/harmony/core/types/block.go:491
github.com/harmony-one/harmony/consensus.(*Consensus).sendCommitMessages(0xc000280780, 0x0)
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:168 +0x55
github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared(0xc000280780, 0xc02ebb0380)
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:263 +0x534
github.com/harmony-one/harmony/consensus.(*Consensus).HandleMessageUpdate(0xc000280780, 0x20e4220, 0xc0280bafc0, 0xc0280baf60, 0xc0cae48f30, 0xc0001962d0, 0xc0001962d0)
        /home/runner/work/harmony/harmony/harmony/consensus/consensus_v2.go:112 +0x3a0
github.com/harmony-one/harmony/node.(*Node).StartPubSub.func2.1(0xc055a9c780, 0xc016b1c8c0, 0xc0280baf00, 0xc00021ac00, 0x20e4220, 0xc0280bafc0, 0x1, 0xc055a9c760, 0xc0280baf60, 0x0, ...)
        /home/runner/work/harmony/harmony/harmony/node/node.go:816 +0x4ae
created by github.com/harmony-one/harmony/node.(*Node).StartPubSub.func2
        /home/runner/work/harmony/harmony/harmony/node/node.go:803 +0x1d8

Environment (please complete the following information):

Additional context

fish2plain commented 2 years ago

on a different shard0 node, gotten similar error but stack trace is on different line.

But I won't be trying test binary on this node. I ran the test binary on another node, and it fell behind ~10K blocks after restart.

Started WS server at: 127.0.0.1:9800
Started Auth-WS server at: 127.0.0.1:9801
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd1ad02]

goroutine 3724389 [running]:
github.com/harmony-one/harmony/core/types.(*Block).NumberU64(0x0, 0xc081bba000)
        /home/runner/work/harmony/harmony/harmony/core/types/block.go:482 +0x22
github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared.func1(0xc000fae000, 0x0)
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:273 +0x54
created by github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:270 +0x591
gsampathkumar commented 2 years ago

Same as @fish2plain Ran the test binary for a few hours and it slowed to a crawl with the OUT OF SYNC messages and fell significantly behind.

gsampathkumar commented 2 years ago

Reverting to version 4.3.0 seems to have fixed the issue for now.

sophoah commented 2 years ago

@gsampathkumar could you confirm the testnet binary version you tried ? The latest binary has now another commit that is helping with the sync speed. And just to confirm were you still experiencing the panic issue while using the testnet binary ?

gsampathkumar commented 2 years ago

@sophoah We did not encounter the panic issue using the testnet binary. Only the slow sync.

I will use the latest testnet binary on one of our nodes and test if the slow sync issue gets solved. Will keep this thread posted.

gsampathkumar commented 2 years ago

running one node with

root@HarmonySecondary:/mnt/volume_sfo3_03# ./harmony -V Harmony (C) 2020. harmony, version v7214-v4.3.1-3-g4c9546a4 (jenkins@ 2021-12-19T13:56:15+0000)

Its currently caught up though, and not sure if it will exercise the sync path to test if that slow sync issue has been fixed. Let me know if I should let it fall behind for 1-2 hours and then have it try to catch up.

sophoah commented 2 years ago

@gsampathkumar no need to force the unsync. I've installed the same code on most of our internal node today, eventually in January, this may become a new release.

lcgogo commented 2 years ago

same issue

Started RPC server at: 0.0.0.0:62075
Started Auth-RPC server at: 0.0.0.0:9501
Started WS server at: 127.0.0.1:9800
Started Auth-WS server at: 127.0.0.1:9801
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd1ad02]

goroutine 1901533 [running]:
github.com/harmony-one/harmony/core/types.(*Block).NumberU64(0x0, 0xc01934b750)
        /home/runner/work/harmony/harmony/harmony/core/types/block.go:482 +0x22
github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared.func1(0xc000dd2000, 0x0)
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:273 +0x54
created by github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:270 +0x591

./harmony --version Harmony (C) 2020. harmony, version v7211-v4.3.1-0-g65614950 (runner@ 2021-11-27T05:27:53+0000)

lcgogo commented 2 years ago

@sophoah Met the issue again. Downgrade to 4.3.0 now.

sophoah commented 2 years ago

@rlan35 any idea ? seems it happens to some node still, and on validator node, not only explorer node

staking4all commented 2 years ago

Hi

This issue happens at epoch change. I have all my nodes went off yesterday at epoch change. Happen while I was sleeping and woke up to a whole bunch of monitoring alerts. Been unelected due to this bug.

Did it again today to all nodes again at epoch change over. Ensuring restart now on service so doesn't unelect me again.

Thanks

OleFass commented 2 years ago

I have a node running on Ubuntu 20.04. with default configs for 3 weeks. The hardware exceeds the requirements by a multiple. Nothing else runs on the server. Still the same issue happens to my node roughly once a day too:

Started RPC server at: 127.0.0.1:9500
Started Auth-RPC server at: 127.0.0.1:9501
Started WS server at: 127.0.0.1:9800
Started Auth-WS server at: 127.0.0.1:9801
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd1ad02]

goroutine 4070621 [running]:
github.com/harmony-one/harmony/core/types.(*Block).NumberU64(0x0, 0xc00e804f60)
        /home/runner/work/harmony/harmony/harmony/core/types/block.go:482 +0x22
github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared.func1(0xc00013e500, 0x0)
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:273 +0x54
created by github.com/harmony-one/harmony/consensus.(*Consensus).onPrepared
        /home/runner/work/harmony/harmony/harmony/consensus/validator.go:270 +0x591

Downgraded to 4.3.0 for now.

staking4all commented 2 years ago

Hi

Just an update

Node kept giving same error at epoch change over.

So switched all my nodes to the testnet version. Since then no more crashes.

Thanks.

zmyya commented 1 year ago

I faced the same problem again version v8126-v2023.2.7-0-g1b9614ba