rodenvk / greenhouse-crypto

Validator on Solana
GNU General Public License v3.0
1 stars 0 forks source link

Network outage on 2024-08-15 20:30-20:51 UTC -> 21:52 UTC #18

Open rodenvk opened 2 months ago

rodenvk commented 2 months ago

Dyjix provider has faced a network issue resetting all BGP sessions. Their router had to be restarted to restore connectivity.

rodenvk commented 2 months ago

Manual actions done on validators: At 20:33 UTC mainnet identity was switched on the backup node GREEN-HOUSE-2. Our validator went delinquent during 3min

Unfortunately no outband access was available to turn off identity on GREEN-HOUSE-1

At 20:51 UTC both main/backup nodes were providing the same identity leading to a crash of both nodes: On GREEN-HOUSE-1:

[2024-08-15T20:51:22.949179461Z ERROR solana_gossip::cluster_info] duplicate running instances of the same validator node: DtY5Bzxd75iWQRvKwM2xLUxqwLT1RRoeNwmVvgS2JANA
[2024-08-15T20:51:22.950669791Z WARN  solana_core::proxy::fetch_stage_manager] packet intercept receiver disconnected, shutting down

On GREEN-HOUSE-2:

[2024-08-15T20:57:58.022046649Z ERROR solana_gossip::cluster_info] duplicate running instances of the same validator node: DtY5Bzxd75iWQRvKwM2xLUxqwLT1RRoeNwmVvgS2JANA
[2024-08-15T20:57:58.023648199Z WARN  solana_core::proxy::fetch_stage_manager] packet intercept receiver disconnected, shutting down
rodenvk commented 2 months ago

Both nodes had to be restarted. Restart operations took longer than expected due to the need to download a fresh snapshot from Solana directly and download was interrupted regularly. Several attempts were reauired, leading to to +50min of additional outage

rodenvk commented 2 months ago

Improvement will be proposed to perform cluster healthchecks locally to avoid this conflict on master/slave role providing our mainnef identity