Closed solidiris closed 6 years ago
Hi, apologies for the late reply. Did you figured it out? What the description (you did not change anything and a result of get pods) you should be using etcd as a backend, but it looks like you did not deploy it.
Hello Sergey! Thankyou for addressing a potential issue. As soon as I will be working again on this, I will try and post here the results
Same behavior.
cannot get cluster data: context deadline exceeded
I did: git clone ... helm install --name mine stolon/ --debug
kubectl logs -f mine-stolon-keeper-0
2018-06-08T13:14:15.894Z WARN cmd/keeper.go:158 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon/pg_repl_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z WARN cmd/keeper.go:158 password file permissions are too open. This file should only be readable to the user executing stolon! Continuing... {"file": "/etc/secrets/stolon/pg_su_password", "mode": "01000000777"}
2018-06-08T13:14:15.894Z INFO cmd/keeper.go:1914 exclusive lock on data dir taken
2018-06-08T13:14:15.896Z INFO cmd/keeper.go:486 keeper uid {"uid": "keeper0"}
2018-06-08T13:14:20.896Z ERROR cmd/keeper.go:693 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:25.902Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:35.902Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:45.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:55.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:05.903Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:15.904Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:25.904Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:35.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:45.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:15:55.905Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:05.906Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:15.906Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:25.907Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:35.907Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:45.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:16:55.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:05.908Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:15.909Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:25.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:35.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:45.910Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:17:55.911Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:05.911Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:15.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:25.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:35.912Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:45.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:18:55.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:05.913Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:15.914Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:19:25.914Z ERROR cmd/keeper.go:932 error retrieving cluster data {"error": "context deadline exceeded"}
nel-5d7688f76d-b9lms mine-stolon-sentinel-5d7688f76d-r4dd8
nseyvet@xu-nseyvet-01:/local/git/gerrit/monasca-common$ kubectl logs -f mine-stolon-sentinel-5d7688f76d-79g6h
2018-06-08T13:14:17.252Z INFO cmd/sentinel.go:1873 sentinel uid {"uid": "a218c47b"}
2018-06-08T13:14:17.252Z INFO cmd/sentinel.go:94 Trying to acquire sentinels leadership
2018-06-08T13:14:22.252Z ERROR cmd/sentinel.go:1727 error retrieving cluster data {"error": "context deadline exceeded"}
2018-06-08T13:14:32.253Z ERROR cmd/sentinel.go:1727 error retrieving cluster data {"error": "context deadline exceeded"}
My guess is that this is a pb with the following defaults:
- --store-backend=etcdv3
- --store-endpoints=http://etcd-etcd-0.etcd-etcd:2379,http://etcd-etcd-1.etcd-etcd:2379,http://etcd-etcd-2.etcd-etcd:2379
used the etcd helm chart in incubator:
helm install --name red incubator/etcd
Then
kubectl get endpoints
-> endpoints: "http://10.233.106.62:2379,http://10.233.66.234:2379,http://10.233.74.135:2379"
Then using the endpoints, it seems to work (ie no constant restarts) but I still see similar errors in the logs:
2018-06-08T14:56:03.182Z INFO cmd/proxy.go:383 proxy uid {"uid": "742aa2fb"}
2018-06-08T14:56:08.189Z INFO cmd/proxy.go:319 check function error {"error": "cannot get cluster data: context deadline exceeded"}
2018-06-08T14:56:18.189Z INFO cmd/proxy.go:279 check timeout timer fired
2018-06-08T14:56:18.196Z INFO cmd/proxy.go:319 check function error {"error": "cannot get cluster data: context deadline exceeded"}```
I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd. Could you try it, and confirm that it works for you
Thanks. I will try that tomorrow.
Any ideas about the problem w etcd?
On Mon, 11 Jun 2018 at 20:04, Sergey Nuzhdin notifications@github.com wrote:
I updated values.yaml so that kubernetes is the default storage now, to avoid future problems of not having etcd. Could you try it, and confirm that it works for you
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396333142, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXCO5V7jvHF4G3xaiYSx1btO4GD-Aks5t7rFLgaJpZM4T29XA .
Most likely your cluster-create-job failed for some reason during the initial start. I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.
Great!
There is a limitation w Kubernetes store backend as the pod annotations keep being updated so it looks as if the pods are not running well w e.g. “kubectl get pods -w”
Could it be simpler to add etcd within this chart instead? It may be better from HA perspective too.
On Mon, 11 Jun 2018 at 20:22, Sergey Nuzhdin notifications@github.com wrote:
Most likely your cluster-create-job failed for some reason during the initial start. I've seen a lot of similar problems due to some misconfigured value. That's why I'm changing the default, so people could start playing with the working version, without dealing with external dependency.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396338437, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXJTDk50Q1UvZCPCH-f4db2GC23Peks5t7rVdgaJpZM4T29XA .
Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care.
I used to have dependency on incubator/etcd
, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod.
https://github.com/kubernetes/charts/issues/685. But the bug is still there. So it was removed.
In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.
In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart. Hope this make sense.
Will test it tomorrow!
Thanks
On Mon, 11 Jun 2018 at 20:40, Sergey Nuzhdin notifications@github.com wrote:
Yes, that problem exists. It's a bit annoying if you used to watch pods (as I am), but most people don't care. I used to have dependency on incubator/etcd, which was part of this chart, until I found out that it is broken and cannot recovery restart of any pod. kubernetes/charts#685 https://github.com/kubernetes/charts/issues/685. But the bug is still there. So it was removed.
In terms of simplicity, kubernetes backend is as simple as dependent etcd, or even easier.
In terms of HA though, if you really care about HA, you should have separately managed multinode deployment of etcd, with backups, rolling-updates etc., and it should not depend on the chart. Hope this make sense.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lwolf/stolon-chart/issues/22#issuecomment-396343913, or mute the thread https://github.com/notifications/unsubscribe-auth/AdweXNbESqNEnl4JKQJENPImftCqFSiPks5t7rmDgaJpZM4T29XA .
Works!
great, thanks for letting me know
Hi all! I set up a VM with minikube on one of the hosts I manage and tested with simple pods. I tried the stolon chart: the installation runs smoothly (no errors reported, running with debug option enabled), but the service pods (proxy, sentinel, keeper) and the psql pods all enter an error state and psql pods are created repeatedly (I counted over 300 pods if left running for some time) I tries reducing the overall resources consumption by lowering the replicas nr to 1, mem request to 256, cpu to 50m, but no success. I didn't change anything except the above mentioned number in the chart files. Logs are uploaded on gist here: https://gist.github.com/solidiris/362b6e5b29559e0a13680a5ded025d41 Have you ever met the same problem? Is there something wrong?
Thankyou