Open hongkailiu opened 6 years ago
Forgot to mention: Need to overwrite the path of kubectl
.
# ansible-playbook -i ~/aaa/gcs.yml deploy-gcs.yml --extra-vars "kubectl=/usr/bin/kubectl" -v
Nice. Thanks for trying it out. :+1:
Since you've changed the SS name, I think you'll also need to change GD2_CLIENTADDRESS
and GD2_PEERADDRESS
. They should end up w/ the name of the pod that gets spawned by the SS that you renamed... i.e., <the_ss_name>-0
. I think this will fix the problem of gd2 being unhealthy.
@JohnStrunk
I changed the name by using the new var, the old var kube_hostname
is kept intact.
- name: gluster-{{ kube_hostname }}
+ name: gluster-{{ gcs_node_index }}
# oc get sts gluster-ip-172-31-47-15-us-west-2-compute-internal -o yaml | grep ESS -A1 | grep -v Set
creationTimestamp: 2018-10-25T19:31:43Z
--
- name: GD2_CLIENTADDRESS
value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24007
- name: GD2_PEERADDRESS
value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24008
So that should not be the cause. Did I missed something?
From above:
Environment:
GD2_ETCDENDPOINTS: http://etcd-client.gcs:2379
GD2_CLUSTER_ID: dd68cd6b-b828-4c13-86a4-35c492b5d4c2
GD2_CLIENTADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
GD2_PEERADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
GD2_RESTAUTH: false
Client and peer addresses need to aim at the pod's address. I think you need to update those ENV vars in the template in addition to changing the name field.
The pod's name is now gluster-1-0
, but client & peer still point to the old name.
Yes. Now I understand.
New issue:
TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018 19:51:00 +0000 (0:00:00.097) 0:00:54.215 ******
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).
fatal: [master]: FAILED! => {"msg": "The conditional check 'result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)' failed. The error was: error while evaluating conditional (result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)): 'dict object' has no attribute 'kube-node'"}
My inv file:
# cat ~/aaa/gcs.yml
master ansible_host=ip-172-31-43-164.us-west-2.compute.internal
ip-172-31-47-15.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-59-125.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-60-208.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
[kube-master]
master
[gcs-node]
ip-172-31-47-15.us-west-2.compute.internal
ip-172-31-59-125.us-west-2.compute.internal
ip-172-31-60-208.us-west-2.compute.internal
I think I know the problem. It might be groups['gcs-node']
instead of groups['kube-node']
.
Will know soon.
Now new issue:
TASK [GCS | GD2 Cluster | Add devices | Set facts] *****************************************************************************
Thursday 25 October 2018 20:15:04 +0000 (0:00:00.041) 0:01:30.000 ******
ok: [master] => {"ansible_facts": {"kube_hostname": "ip"}, "changed": false}
TASK [GCS | GD2 Cluster | Add devices | Add devices for ip] ********************************************************************
Thursday 25 October 2018 20:15:05 +0000 (0:00:00.115) 0:01:30.115 ******
fatal: [master]: FAILED! => {"msg": "u\"hostvars['ip']\" is undefined"}
Will continue tomorrow.
Found the problem again. The playbook has strong restrictions on the node name.
https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L5
This assumes that index 1 has the node name after splitting. Will think about the solution tomorrow.
https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L13 This line uses the json result from the endpoint and then uses hostname as key to get device in inventory. Maybe we need to wait for the fix/response from bz1643191.
The workaround above fix the problem of the playbook only partially. Cannot go through.
Need scc setup of hostpath plugin: Followed this doc to work it around.
the opc cluster is on aws: the node name is like
ip-172-31-10-185.us-west-2.compute.internal
which causes problem for the statefulset. Created bz1643191To workaround this issue: Use a simpler name for the statefulset
Wait for glusterd2-cluster to become ready
oc get pod
NAME READY STATUS RESTARTS AGE etcd-6jmbmv6sw7 1/1 Running 0 23m etcd-mvwq6c2w6f 1/1 Running 0 23m etcd-n92rtb9wfr 1/1 Running 0 23m etcd-operator-54bbdfc55d-mdvd9 1/1 Running 0 24m gluster-0-0 1/1 Running 7 23m gluster-1-0 1/1 Running 7 23m gluster-2-0 1/1 Running 7 23m
oc describe pod gluster-1-0
Name: gluster-1-0 Namespace: gcs Priority: 0 PriorityClassName:
Node: ip-172-31-59-125.us-west-2.compute.internal/172.31.59.125
Start Time: Thu, 25 Oct 2018 18:39:49 +0000
Labels: app.kubernetes.io/component=glusterfs
app.kubernetes.io/name=glusterd2
app.kubernetes.io/part-of=gcs
controller-revision-hash=gluster-1-598d756667
statefulset.kubernetes.io/pod-name=gluster-1-0
Annotations: openshift.io/scc=hostpath
Status: Running
IP: 172.21.0.15
Controlled By: StatefulSet/gluster-1
Containers:
glusterd2:
Container ID: docker://0433446ecbd7a25d5aa9f51f0bd5c3226090850b18d0e63d58d07e47c6fdd039
Image: docker.io/gluster/glusterd2-nightly:20180920
Image ID: docker-pullable://docker.io/gluster/glusterd2-nightly@sha256:7013c3de3ed2c8b9c380c58b7c331dfc70df39fe13faea653b25034545971072
Port:
Host Port:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 25 Oct 2018 19:00:48 +0000
Finished: Thu, 25 Oct 2018 19:03:48 +0000
Ready: False
Restart Count: 7
Liveness: http-get http://:24007/ping delay=10s timeout=1s period=60s #success=1 #failure=3
Environment:
GD2_ETCDENDPOINTS: http://etcd-client.gcs:2379
GD2_CLUSTER_ID: dd68cd6b-b828-4c13-86a4-35c492b5d4c2
GD2_CLIENTADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
GD2_PEERADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
GD2_RESTAUTH: false
Mounts:
/dev from gluster-dev (rw)
/run/lvm from gluster-lvm (rw)
/sys/fs/cgroup from gluster-cgroup (ro)
/usr/lib/modules from gluster-kmods (ro)
/var/lib/glusterd2 from glusterd2-statedir (rw)
/var/log/glusterd2 from glusterd2-logdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hvj7w (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
gluster-dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
Events:
Type Reason Age From Message
gluster-cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup HostPathType:
gluster-lvm: Type: HostPath (bare host directory volume) Path: /run/lvm HostPathType:
gluster-kmods: Type: HostPath (bare host directory volume) Path: /usr/lib/modules HostPathType:
glusterd2-statedir: Type: HostPath (bare host directory volume) Path: /var/lib/glusterd2 HostPathType: DirectoryOrCreate glusterd2-logdir: Type: HostPath (bare host directory volume) Path: /var/log/glusterd2 HostPathType: DirectoryOrCreate default-token-hvj7w: Type: Secret (a volume populated by a Secret) SecretName: default-token-hvj7w Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/compute=true Tolerations:
Normal Scheduled 25m default-scheduler Successfully assigned gcs/gluster-1-0 to ip-172-31-59-125.us-west-2.compute.internal Normal Pulling 25m kubelet, ip-172-31-59-125.us-west-2.compute.internal pulling image "docker.io/gluster/glusterd2-nightly:20180920" Normal Pulled 24m kubelet, ip-172-31-59-125.us-west-2.compute.internal Successfully pulled image "docker.io/gluster/glusterd2-nightly:20180920" Normal Created 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Created container Normal Started 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Started container Normal Killing 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Killing container with id docker://glusterd2:Container failed liveness probe.. Container will be killed and recreated. Normal Pulled 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Container image "docker.io/gluster/glusterd2-nightly:20180920" already present on machine Warning Unhealthy 4m (x21 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Liveness probe failed: Get http://172.21.0.15:24007/ping: dial tcp 172.21.0.15:24007: connect: connection refused