sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.62k stars 443 forks source link

Node restart leads to data loss of the entire cluster #917

Closed jackin853 closed 10 months ago

jackin853 commented 10 months ago

The stolon three-node cluster is deployed with three keepers, three sentinels, and three proxies, all controlled by a DaemonSet. The backend storage used is etcd. One of our nodes was intentionally power-off, resulting in the loss of data in the entire cluster. Currently, the investigation suggests that the node that was powered off was the keeper master. After it was powered off, a new master was selected from the other two slaves. Once the new master was chosen, the other slave nodes would perform a full data synchronization. However, before this slave could complete the full synchronization, the current master encountered issues, and this slave was again chosen as the new master. Consequently, the data on this newest master is empty.

sgotti commented 10 months ago

@jackin853 There's no information related on how it was deployed, how to reproduce it, related logs etc... Given that stolon was born and can easily manage such scenarios if correctly deployed (persistent volumes etc...) I'm going to close this until you provide a way to reproduce it.