lwolf / stolon-chart

Kubernetes Helm chart to deploy HA Postgresql cluster based on Stolon
MIT License
105 stars 39 forks source link

Etcd cluster in CrashLoopBackOff loop #2

Closed timfpark closed 7 years ago

timfpark commented 7 years ago

I've been running a Stolon cluster for about a week (very successfully) but today I noticed that I have lost a etcd pod completely and another is a CrashLoopBackOff cycle:

postgresql               postgresql-etcd-0                                  0/1       CrashLoopBackOff   159        13h
postgresql               postgresql-etcd-2                                  1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-0                         1/1       Running            0          13h
postgresql               postgresql-stolon-keeper-1                         1/1       Running            0          1d
postgresql               postgresql-stolon-keeper-2                         1/1       Running            0          1d
postgresql               postgresql-stolon-proxy-3377369672-4vq28           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-5jsd5           0/1       Running            0          13h
postgresql               postgresql-stolon-proxy-3377369672-qrxm6           0/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-fwc9w        1/1       Running            0          1d
postgresql               postgresql-stolon-sentinel-2884560845-r34nv        1/1       Running            0          13h
postgresql               postgresql-stolon-sentinel-2884560845-wgp4q        1/1       Running            0          1d

The logs for postgresql-etcd-0 are the following:

Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Have you seen this before? Is there anyway to manually restart the etcd portion of the cluster easily?

lwolf commented 7 years ago

Hi Tim, Unfortunately yes. I mentioned it in the readme for this chart and created an issue in kubernetes/charts#685

I didn't try it, but theoretically, it should be possible to manually delete lost etcd members from the cluster and then scale up the cluster.

timfpark commented 7 years ago

Thanks for your answer and sorry for missing it in the README

bharthur commented 6 years ago

For anyone coming here in future. You can use this config to create the etcd cluster instead (tested only on GKE). The pull request here kubernetes/charts#685 didn't work for me on GKE.

rajeshneo commented 4 years ago

Workaround for this without loosing you data or recreating your whole cluster. Use helm to scale down your cluster by 1 node helm upgrade etcd incubator/etcd --set replicas=2 Wait for few minutes and all nodes will do rolling restart. Scale it back up and voila :)