Open rodenvk opened 2 months ago
Manual actions done on validators: At 20:33 UTC mainnet identity was switched on the backup node GREEN-HOUSE-2. Our validator went delinquent during 3min
Unfortunately no outband access was available to turn off identity on GREEN-HOUSE-1
At 20:51 UTC both main/backup nodes were providing the same identity leading to a crash of both nodes: On GREEN-HOUSE-1:
[2024-08-15T20:51:22.949179461Z ERROR solana_gossip::cluster_info] duplicate running instances of the same validator node: DtY5Bzxd75iWQRvKwM2xLUxqwLT1RRoeNwmVvgS2JANA
[2024-08-15T20:51:22.950669791Z WARN solana_core::proxy::fetch_stage_manager] packet intercept receiver disconnected, shutting down
On GREEN-HOUSE-2:
[2024-08-15T20:57:58.022046649Z ERROR solana_gossip::cluster_info] duplicate running instances of the same validator node: DtY5Bzxd75iWQRvKwM2xLUxqwLT1RRoeNwmVvgS2JANA
[2024-08-15T20:57:58.023648199Z WARN solana_core::proxy::fetch_stage_manager] packet intercept receiver disconnected, shutting down
Both nodes had to be restarted. Restart operations took longer than expected due to the need to download a fresh snapshot from Solana directly and download was interrupted regularly. Several attempts were reauired, leading to to +50min of additional outage
Improvement will be proposed to perform cluster healthchecks locally to avoid this conflict on master/slave role providing our mainnef identity
Dyjix provider has faced a network issue resetting all BGP sessions. Their router had to be restarted to restore connectivity.