Closed EmelyanenkoK closed 3 years ago
This issue affected two of my Validators with two days between the incidents. In both cases the error was raised in logs, validator-engine
has crashed and on restart it would initially spin up and proceed to open database but -again- drop Signal: 6
error in main log file and exit/crash within couple of seconds. Thus the node was completely offline.
In both cases nodes where actively validating during the accident.
In both cases hardware was not overloaded (at least 128gb of memory, 48 or more CPU cores and NVMe based file systems with hundreds of free gigabytes).
In both cases the only way to resolve the issue was to:
db/config.json
as well as db/keyring
into new directory (this is important since your machine probably has new keys in place.validator-engine
Engine will start to resync and will return online to validation duties.
Signal: 6
is not the reason for the crash, it is the symptom, something happens before that and corrupts the database.
I will include log files from both machines for the moment of the first crash as well as 2 seconds before that.
I will add that I got those problems on non-validating full nodes as well.
From time to time nodes both on TCF testnet (earlier) and on Newton testnet crash with signal 6. After that nor reboot, nor deleting recent files helps. The only way to join network again is sync from scratch, so far. @sonofmom please share your observation and ideas related to this bug here.