lwolf / stolon-chart

Kubernetes Helm chart to deploy HA Postgresql cluster based on Stolon
MIT License
105 stars 39 forks source link

stolon chart deployment not working #22

Closed solidiris closed 6 years ago

solidiris commented 6 years ago

Hi all! I set up a VM with minikube on one of the hosts I manage and tested with simple pods. I tried the stolon chart: the installation runs smoothly (no errors reported, running with debug option enabled), but the service pods (proxy, sentinel, keeper) and the psql pods all enter an error state and psql pods are created repeatedly (I counted over 300 pods if left running for some time) I tries reducing the overall resources consumption by lowering the replicas nr to 1, mem request to 256, cpu to 50m, but no success. I didn't change anything except the above mentioned number in the chart files. Logs are uploaded on gist here: https://gist.github.com/solidiris/362b6e5b29559e0a13680a5ded025d41 Have you ever met the same problem? Is there something wrong?

Thankyou

lwolf commented 6 years ago

Hi, apologies for the late reply. Did you figured it out? What the description (you did not change anything and a result of get pods) you should be using etcd as a backend, but it looks like you did not deploy it.

solidiris commented 6 years ago

Hello Sergey! Thankyou for addressing a potential issue. As soon as I will be working again on this, I will try and post here the results

nseyvet commented 6 years ago

Same behavior.

cannot get cluster data: context deadline exceeded

I did: git clone ... helm install --name mine stolon/ --debug

kubectl logs -f mine-stolon-keeper-0
2018-06-08T13:14:15.894Z    WARN    cmd/keeper.go:158   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon/pg_repl_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z    WARN    cmd/keeper.go:158   password file permissions are too open. This file should only be readable to the user executing stolon! Continuing...   {"file": "/etc/secrets/stolon/pg_su_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z    INFO    cmd/keeper.go:1914  exclusive lock on data dir taken
2018-06-08T13:14:15.896Z    INFO    cmd/keeper.go:486   keeper uid  {"uid": "keeper0"}
2018-06-08T13:14:20.896Z    ERROR   cmd/keeper.go:693   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:14:25.902Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:14:35.902Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:14:45.903Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:14:55.903Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:05.903Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:15.904Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:25.904Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:35.905Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:45.905Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:15:55.905Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:05.906Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:15.906Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:25.907Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:35.907Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:45.908Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:16:55.908Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:05.908Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:15.909Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:25.910Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:35.910Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:45.910Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:17:55.911Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:05.911Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:15.912Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:25.912Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:35.912Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:45.913Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:18:55.913Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:19:05.913Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:19:15.914Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:19:25.914Z    ERROR   cmd/keeper.go:932   error retrieving cluster data   {"error": "context deadline exceeded"}
nel-5d7688f76d-b9lms  mine-stolon-sentinel-5d7688f76d-r4dd8  
nseyvet@xu-nseyvet-01:/local/git/gerrit/monasca-common$ kubectl logs -f mine-stolon-sentinel-5d7688f76d-79g6h 
2018-06-08T13:14:17.252Z    INFO    cmd/sentinel.go:1873    sentinel uid    {"uid": "a218c47b"}
2018-06-08T13:14:17.252Z    INFO    cmd/sentinel.go:94  Trying to acquire sentinels leadership
2018-06-08T13:14:22.252Z    ERROR   cmd/sentinel.go:1727    error retrieving cluster data   {"error": "context deadline exceeded"}
2018-06-08T13:14:32.253Z    ERROR   cmd/sentinel.go:1727    error retrieving cluster data   {"error": "context deadline exceeded"}

My guess is that this is a pb with the following defaults:

- --store-backend=etcdv3
          - --store-endpoints=http://etcd-etcd-0.etcd-etcd:2379,http://etcd-etcd-1.etcd-etcd:2379,http://etcd-etcd-2.etcd-etcd:2379
nseyvet commented 6 years ago

used the etcd helm chart in incubator: helm install --name red incubator/etcd Then kubectl get endpoints -> endpoints: "http://10.233.106.62:2379,http://10.233.66.234:2379,http://10.233.74.135:2379" Then using the endpoints, it seems to work (ie no constant restarts) but I still see similar errors in the logs:


2018-06-08T14:56:03.182Z    INFO    cmd/proxy.go:383    proxy uid   {"uid": "742aa2fb"}
2018-06-08T14:56:08.189Z    INFO    cmd/proxy.go:319    check function error    {"error": "cannot get cluster data: context deadline exceeded"}
2018-06-08T14:56:18.189Z    INFO    cmd/proxy.go:279    check timeout timer fired
2018-06-08T14:56:18.196Z    INFO    cmd/proxy.go:319    check function error    {"error": "cannot get cluster data: context deadline exceeded"}```
lwolf commented 6 years ago

I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd. Could you try it, and confirm that it works for you

nseyvet commented 6 years ago

Thanks. I will try that tomorrow.

Any ideas about the problem w etcd?

On Mon, 11 Jun 2018 at 20:04, Sergey Nuzhdin notifications@github.com wrote:

I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd. Could you try it, and confirm that it works for you

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396333142, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXCO5V7jvHF4G3xaiYSx1btO4GD-Aks5t7rFLgaJpZM4T29XA .

lwolf commented 6 years ago

Most likely your cluster-create-job failed for some reason during the initial start. I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.

nseyvet commented 6 years ago

Great!

There is a limitation w Kubernetes store backend as the pod annotations keep being updated so it looks as if the pods are not running well w e.g. “kubectl get pods -w”

Could it be simpler to add etcd within this chart instead? It may be better from HA perspective too.

On Mon, 11 Jun 2018 at 20:22, Sergey Nuzhdin notifications@github.com wrote:

Most likely your cluster-create-job failed for some reason during the initial start. I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396338437, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXJTDk50Q1UvZCPCH-f4db2GC23Peks5t7rVdgaJpZM4T29XA .

lwolf commented 6 years ago

Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care. I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod. https://github.com/kubernetes/charts/issues/685. But the bug is still there. So it was removed.

In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.

In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart. Hope this make sense.

nseyvet commented 6 years ago

Will test it tomorrow!

Thanks

On Mon, 11 Jun 2018 at 20:40, Sergey Nuzhdin notifications@github.com wrote:

Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care. I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod. kubernetes/charts#685 https://github.com/kubernetes/charts/issues/685. But the bug is still there. So it was removed.

In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.

In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart. Hope this make sense.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396343913, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXNbESqNEnl4JKQJENPImftCqFSiPks5t7rmDgaJpZM4T29XA .

nseyvet commented 6 years ago

Works!

lwolf commented 6 years ago

great, thanks for letting me know