hongkailiu commented 6 years ago

Need scc setup of hostpath plugin: Followed this doc to work it around.
the opc cluster is on aws: the node name is like ip-172-31-10-185.us-west-2.compute.internal which causes problem for the statefulset. Created bz1643191

To workaround this issue: Use a simpler name for the statefulset

# git diff deploy-gcs.yml tasks/create-gd2-manifests.yml templates/gcs-manifests/gcs-gd2.yml.j2
diff --git a/deploy/deploy-gcs.yml b/deploy/deploy-gcs.yml
index 5efd085..fc0bac1 100644
--- a/deploy/deploy-gcs.yml
+++ b/deploy/deploy-gcs.yml
@@ -42,8 +42,10 @@

         - name: GCS Pre | Manifests | Create GD2 manifests
           include_tasks: tasks/create-gd2-manifests.yml
           loop: "{{ groups['gcs-node'] }}"
           loop_control:
+            index_var: index
             loop_var: gcs_node

   post_tasks:
diff --git a/deploy/tasks/create-gd2-manifests.yml b/deploy/tasks/create-gd2-manifests.yml
index d9a2d2d..4c015ef 100644
--- a/deploy/tasks/create-gd2-manifests.yml
+++ b/deploy/tasks/create-gd2-manifests.yml
@@ -3,6 +3,7 @@
 - name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Set fact kube_hostname
   set_fact:
     kube_hostname: "{{ gcs_node }}"
+    gcs_node_index: "{{ index }}"

 - name: GCS Pre | Manifests | Create GD2 manifests for {{ gcs_node }} | Create gcs-gd2-{{ gcs_node }}.yml
   template:
diff --git a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2 b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
index fe48b35..3376b11 100644
--- a/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
+++ b/deploy/templates/gcs-manifests/gcs-gd2.yml.j2
@@ -2,7 +2,7 @@
 kind: StatefulSet
 apiVersion: apps/v1
 metadata:
-  name: gluster-{{ kube_hostname }}
+  name: gluster-{{ gcs_node_index }}
   namespace: {{ gcs_namespace }}
   labels:
     app.kubernetes.io/part-of: gcs

Then failed on Wait for glusterd2-cluster to become ready


TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018  18:39:50 +0000 (0:00:00.083)       0:00:54.274 ****** 
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).

oc get pod

NAME READY STATUS RESTARTS AGE etcd-6jmbmv6sw7 1/1 Running 0 23m etcd-mvwq6c2w6f 1/1 Running 0 23m etcd-n92rtb9wfr 1/1 Running 0 23m etcd-operator-54bbdfc55d-mdvd9 1/1 Running 0 24m gluster-0-0 1/1 Running 7 23m gluster-1-0 1/1 Running 7 23m gluster-2-0 1/1 Running 7 23m

oc describe pod gluster-1-0

Name: gluster-1-0 Namespace: gcs Priority: 0 PriorityClassName: Node: ip-172-31-59-125.us-west-2.compute.internal/172.31.59.125 Start Time: Thu, 25 Oct 2018 18:39:49 +0000 Labels: app.kubernetes.io/component=glusterfs app.kubernetes.io/name=glusterd2 app.kubernetes.io/part-of=gcs controller-revision-hash=gluster-1-598d756667 statefulset.kubernetes.io/pod-name=gluster-1-0 Annotations: openshift.io/scc=hostpath Status: Running IP: 172.21.0.15 Controlled By: StatefulSet/gluster-1 Containers: glusterd2: Container ID: docker://0433446ecbd7a25d5aa9f51f0bd5c3226090850b18d0e63d58d07e47c6fdd039 Image: docker.io/gluster/glusterd2-nightly:20180920 Image ID: docker-pullable://docker.io/gluster/glusterd2-nightly@sha256:7013c3de3ed2c8b9c380c58b7c331dfc70df39fe13faea653b25034545971072 Port: Host Port: State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Thu, 25 Oct 2018 19:00:48 +0000 Finished: Thu, 25 Oct 2018 19:03:48 +0000 Ready: False Restart Count: 7 Liveness: http-get http://:24007/ping delay=10s timeout=1s period=60s #success=1 #failure=3 Environment: GD2_ETCDENDPOINTS: http://etcd-client.gcs:2379 GD2_CLUSTER_ID: dd68cd6b-b828-4c13-86a4-35c492b5d4c2 GD2_CLIENTADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007 GD2_PEERADDRESS: gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008 GD2_RESTAUTH: false Mounts: /dev from gluster-dev (rw) /run/lvm from gluster-lvm (rw) /sys/fs/cgroup from gluster-cgroup (ro) /usr/lib/modules from gluster-kmods (ro) /var/lib/glusterd2 from glusterd2-statedir (rw) /var/log/glusterd2 from glusterd2-logdir (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-hvj7w (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: gluster-dev: Type: HostPath (bare host directory volume) Path: /dev HostPathType:
gluster-cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup HostPathType:
gluster-lvm: Type: HostPath (bare host directory volume) Path: /run/lvm HostPathType:
gluster-kmods: Type: HostPath (bare host directory volume) Path: /usr/lib/modules HostPathType:
glusterd2-statedir: Type: HostPath (bare host directory volume) Path: /var/lib/glusterd2 HostPathType: DirectoryOrCreate glusterd2-logdir: Type: HostPath (bare host directory volume) Path: /var/log/glusterd2 HostPathType: DirectoryOrCreate default-token-hvj7w: Type: Secret (a volume populated by a Secret) SecretName: default-token-hvj7w Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/compute=true Tolerations: Events: Type Reason Age From Message

Normal Scheduled 25m default-scheduler Successfully assigned gcs/gluster-1-0 to ip-172-31-59-125.us-west-2.compute.internal Normal Pulling 25m kubelet, ip-172-31-59-125.us-west-2.compute.internal pulling image "docker.io/gluster/glusterd2-nightly:20180920" Normal Pulled 24m kubelet, ip-172-31-59-125.us-west-2.compute.internal Successfully pulled image "docker.io/gluster/glusterd2-nightly:20180920" Normal Created 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Created container Normal Started 16m (x4 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Started container Normal Killing 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Killing container with id docker://glusterd2:Container failed liveness probe.. Container will be killed and recreated. Normal Pulled 16m (x3 over 22m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Container image "docker.io/gluster/glusterd2-nightly:20180920" already present on machine Warning Unhealthy 4m (x21 over 24m) kubelet, ip-172-31-59-125.us-west-2.compute.internal Liveness probe failed: Get http://172.21.0.15:24007/ping: dial tcp 172.21.0.15:24007: connect: connection refused

hongkailiu commented 6 years ago

Forgot to mention: Need to overwrite the path of kubectl.

# ansible-playbook -i ~/aaa/gcs.yml deploy-gcs.yml --extra-vars "kubectl=/usr/bin/kubectl" -v

JohnStrunk commented 6 years ago

Nice. Thanks for trying it out. :+1: Since you've changed the SS name, I think you'll also need to change GD2_CLIENTADDRESS and GD2_PEERADDRESS. They should end up w/ the name of the pod that gets spawned by the SS that you renamed... i.e., <the_ss_name>-0. I think this will fix the problem of gd2 being unhealthy.

hongkailiu commented 6 years ago

@JohnStrunk I changed the name by using the new var, the old var kube_hostname is kept intact.

-  name: gluster-{{ kube_hostname }}
+  name: gluster-{{ gcs_node_index }}

# oc get sts gluster-ip-172-31-47-15-us-west-2-compute-internal -o yaml | grep ESS -A1 | grep -v Set
  creationTimestamp: 2018-10-25T19:31:43Z
--
        - name: GD2_CLIENTADDRESS
          value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24007
        - name: GD2_PEERADDRESS
          value: gluster-ip-172-31-47-15.us-west-2.compute.internal-0.glusterd2.gcs:24008

So that should not be the cause. Did I missed something?

JohnStrunk commented 6 years ago

From above:

   Environment:
      GD2_ETCDENDPOINTS:  http://etcd-client.gcs:2379
      GD2_CLUSTER_ID:     dd68cd6b-b828-4c13-86a4-35c492b5d4c2
      GD2_CLIENTADDRESS:  gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24007
      GD2_PEERADDRESS:    gluster-ip-172-31-59-125.us-west-2.compute.internal-0.glusterd2.gcs:24008
      GD2_RESTAUTH:       false

Client and peer addresses need to aim at the pod's address. I think you need to update those ENV vars in the template in addition to changing the name field. The pod's name is now gluster-1-0, but client & peer still point to the old name.

hongkailiu commented 6 years ago

Yes. Now I understand.

New issue:

TASK [GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready] **********************************************************
Thursday 25 October 2018  19:51:00 +0000 (0:00:00.097)       0:00:54.215 ****** 
FAILED - RETRYING: GCS | GD2 Cluster | Wait for glusterd2-cluster to become ready (50 retries left).
fatal: [master]: FAILED! => {"msg": "The conditional check 'result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)' failed. The error was: error while evaluating conditional (result.status is defined and (result.status == 200 and result.json|length == groups['kube-node']|length)): 'dict object' has no attribute 'kube-node'"}

My inv file:

# cat ~/aaa/gcs.yml 
master ansible_host=ip-172-31-43-164.us-west-2.compute.internal

ip-172-31-47-15.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-59-125.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'
ip-172-31-60-208.us-west-2.compute.internal gcs_disks='["/dev/nvme2n1"]'

[kube-master]
master

[gcs-node]
ip-172-31-47-15.us-west-2.compute.internal
ip-172-31-59-125.us-west-2.compute.internal
ip-172-31-60-208.us-west-2.compute.internal

I think I know the problem. It might be groups['gcs-node'] instead of groups['kube-node']. Will know soon.

hongkailiu commented 6 years ago

Now new issue:

TASK [GCS | GD2 Cluster | Add devices | Set facts] *****************************************************************************
Thursday 25 October 2018  20:15:04 +0000 (0:00:00.041)       0:01:30.000 ****** 
ok: [master] => {"ansible_facts": {"kube_hostname": "ip"}, "changed": false}

TASK [GCS | GD2 Cluster | Add devices | Add devices for ip] ********************************************************************
Thursday 25 October 2018  20:15:05 +0000 (0:00:00.115)       0:01:30.115 ****** 
fatal: [master]: FAILED! => {"msg": "u\"hostvars['ip']\" is undefined"}

Will continue tomorrow.

hongkailiu commented 6 years ago

Found the problem again. The playbook has strong restrictions on the node name.

https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L5

This assumes that index 1 has the node name after splitting. Will think about the solution tomorrow.

hongkailiu commented 6 years ago

https://github.com/gluster/gcs/blob/master/deploy/tasks/add-devices-to-peer.yml#L13 This line uses the json result from the endpoint and then uses hostname as key to get device in inventory. Maybe we need to wait for the fix/response from bz1643191.

The workaround above fix the problem of the playbook only partially. Cannot go through.

gluster / gcs

Failed to deploy on OCP 3.11 #46

oc get pod

oc describe pod gluster-1-0