ton-blockchain / TIPs

Improvement Proposal repository
78 stars 16 forks source link

Spontaneous irreversible node crash with Signal 6 #10

Closed EmelyanenkoK closed 3 years ago

EmelyanenkoK commented 4 years ago

From time to time nodes both on TCF testnet (earlier) and on Newton testnet crash with signal 6. After that nor reboot, nor deleting recent files helps. The only way to join network again is sync from scratch, so far. @sonofmom please share your observation and ideas related to this bug here.

sonofmom commented 4 years ago

This issue affected two of my Validators with two days between the incidents. In both cases the error was raised in logs, validator-engine has crashed and on restart it would initially spin up and proceed to open database but -again- drop Signal: 6 error in main log file and exit/crash within couple of seconds. Thus the node was completely offline.

In both cases nodes where actively validating during the accident.

In both cases hardware was not overloaded (at least 128gb of memory, 48 or more CPU cores and NVMe based file systems with hundreds of free gigabytes).

In both cases the only way to resolve the issue was to:

Engine will start to resync and will return online to validation duties.

Speculation on reasons

Signal: 6 is not the reason for the crash, it is the symptom, something happens before that and corrupts the database.

I will include log files from both machines for the moment of the first crash as well as 2 seconds before that.

EmelyanenkoK commented 4 years ago

I will add that I got those problems on non-validating full nodes as well.

sonofmom commented 4 years ago

node2_crashlogs_signal6.log node1_crashlogs_signal6.log