Closed mmisztal1980 closed 5 years ago
Does it never stabilize? It looks like its starting to work, just hasn't joined all the servers yet.
Also, PVCs might be a real issue for sure. I didn't realize actually that StatefulSets with PVCs can be started before the PVC is available (spoiled in the environment we run in I guess). Do you know what that looks like? Is the directory just not available yet? That might be something we have to build into an init container or something (to wait for it to be ready).
@mitchellh no, it never does, funnily it used to work with the previous release of rook (v0.7).
In my understanding the StatefulSet starts and attempts to bind the PVCs, until that is done, the pod should report an unbound pvc issue - should be easily accessbile via kubectl describe
or kubectl logs
But during that time, the containers are started?
Sorry, easiest way to figure this out would be if you did more digging or I can get a reproduction. For the latter, is there an easy way for me to get a similar environment up and running?
They appear to be - the logs are there.
I'm happy to do more digging, however in order for you to get a repro, I'd have to provide you with terraform files, helm value files & k8s manifests to get a copy of my env going OR I'll simply share a kubeconfig file so that you can poke around and leave it running for the night
@mitchellh I've cloned the repo and made some alterations:
values.yaml
server.storageclass: rook-ceph-block
server-statefulset.yaml
readinessProbe.initialDelaySeconds: 60
ContainerCreating
state for approx 50s when the 1st pod reported Running
. None of the pods were Ready
helm status consul
LAST DEPLOYED: Wed Sep 26 23:34:25 2018
NAMESPACE: service-discovery
STATUS: DEPLOYED
RESOURCES:
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
consul-kktl5 1/1 Running 0 2m
consul-tbjmt 1/1 Running 0 2m
consul-x5cqq 1/1 Running 0 2m
consul-server-0 1/1 Running 0 2m
consul-server-1 1/1 Running 0 2m
consul-server-2 1/1 Running 0 2m
==> v1/ConfigMap
NAME DATA AGE
consul-client-config 1 2m
consul-server-config 1 2m
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
consul-dns ClusterIP 10.3.228.30 <none> 53/TCP,53/UDP 2m
consul-server ClusterIP None <none> 8500/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 2m
consul-ui ClusterIP 10.3.98.102 <none> 80/TCP 2m
==> v1/DaemonSet
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
consul 3 3 3 3 3 <none> 2m
==> v1/StatefulSet
NAME DESIRED CURRENT AGE
consul-server 3 3 2m
==> v1beta1/PodDisruptionBudget
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
consul-server N/A 0 0 2m
If you compare the previous results, the 0/1
(s) mean that the pod is running, however the readiness test is failed.
NAME READY STATUS RESTARTS AGE
consul-56lvs 0/1 Running 0 36s
consul-jttwp 0/1 Running 0 36s
consul-qpgdn 0/1 Running 0 36s
consul-server-0 0/1 ContainerCreating 0 36s
consul-server-1 0/1 ContainerCreating 0 36s
consul-server-2 0/1 Running 0 36s
This is what the docs say:
failureThreshold: When a Pod starts and the probe fails, Kubernetes will try failureThreshold times
before giving up. Giving up in case of liveness probe means restarting the Pod. In case of readiness
probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.
The server-statefulset.yaml failure treshold is:
failureThreshold: 2
and the initial delay is: initialDelaySeconds: 5
Meaning that kubernetes started to fail the checks and gave up before the pvc(s) were bound to the pods?
@mmisztal1980 I made the adjustments that you mentioned, but am still seeing pod failing health checks and servers sitting in a pending state.
consul consul-dzjnx 0/1 Running 0 4m
consul consul-g8lmf 0/1 Running 0 4m
consul consul-kx8l6 0/1 Running 0 4m
consul consul-server-0 0/1 Pending 0 4m
consul consul-server-1 0/1 Pending 0 4m
consul consul-server-2 0/1 Pending 0 4m
Consul logs seem to indicate that the stateful set isn't binding the PVC. I am also using Rook/Ceph for storage.
I checked the claims (kubectl get pvc) but don't see any claims made my Consul showing up in there.
Ugh it looks like I was on an outdated version of the repo. After updating the repo and reinstalling it is working now.
Added a PR where the probe settings are configurable via the chart values. That should help in our case.
@mitchellh would the above PR be satisfactory? Tweaking the probe settings seems to have fixed the issue for myself and @jmreicha
In my cluster I have rook running. When provisioning the consul cluster, my helm values file looks like this:
In order to start the chart, I use the following cmdline:
A quick verification of pvc indicate that they have bound successfully, please note that I've observed that it takes up to 20s to bind the pvc(s) running under rook
However the consul-server pods have failed to start:
An examination of the pod indicates that the readiness probe has failed:
An examination of the server logs indicates that it has failed to form the cluster:
Any hints what may be wrong? I've noticed that the probe's
initialDelaySeconds
default value is 5, so I'm guessing it may have failed before the pvcs have been bound? Perhaps it'd make sense to have this value configurable?