kahkhang / kube-linode

:whale: Provision a Kubernetes/CoreOS cluster on Linode
MIT License
212 stars 31 forks source link

Fixes issue where Rook operator fails to start #51

Closed JamesMura closed 6 years ago

JamesMura commented 6 years ago

Adds the POD_NAMESPACE variable to the rook operator manifest fixes #50

kahkhang commented 6 years ago

Thanks for the fix 👍

displague commented 6 years ago

I cloned after this change was merged and encountered the same issue.

failed to run operator. Error starting agent daemonset: Error starting agent daemonset: cannot detect the pod name. Please provide it using the downward API in the manifest file

I'm new to k8s - Is there a way to recreate this Rook image? I tried to bring up other pods and they all failed PersistentVolumeClaim is not bound because of this. Should I just recreate those Pods (cockroachdb, wordpress, nginx ingress) or would they automatically connect to a new Rook?

displague commented 6 years ago

FWIW -- Here is a diff of the example cluster from https://github.com/rook/rook/blob/master/cluster/examples/kubernetes/rook-operator.yaml and the one included in this repo:

--- rook-operator.yaml  2017-10-25 22:12:46.000000000 -0400
+++ rook-operator.example.yaml  2017-11-01 05:29:49.000000000 -0400
@@ -1,3 +1,8 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: rook-system
+---
 kind: ClusterRole
 apiVersion: rbac.authorization.k8s.io/v1beta1
 metadata:
@@ -12,6 +17,7 @@
   - pods
   - services
   - nodes
+  - nodes/proxy
   - configmaps
   - events
   - persistentvolumes
@@ -52,6 +58,8 @@
   resources:
   - clusterroles
   - clusterrolebindings
+  - roles
+  - rolebindings
   verbs:
   - get
   - list
@@ -79,13 +87,13 @@
 kind: ServiceAccount
 metadata:
   name: rook-operator
-  namespace: default
+  namespace: rook-system
 ---
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1beta1
 metadata:
   name: rook-operator
-  namespace: default
+  namespace: rook-system
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: ClusterRole
@@ -93,13 +101,13 @@
 subjects:
 - kind: ServiceAccount
   name: rook-operator
-  namespace: default
+  namespace: rook-system
 ---
 apiVersion: apps/v1beta1
 kind: Deployment
 metadata:
   name: rook-operator
-  namespace: default
+  namespace: rook-system
 spec:
   replicas: 1
   template:
@@ -122,6 +130,18 @@
         # current mon with a new mon (useful for compensating flapping network).
         - name: ROOK_MON_OUT_TIMEOUT
           value: "300s"
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: spec.nodeName
+        - name: ROOK_OPERATOR_SERVICE_ACCOUNT
+          valueFrom:
+            fieldRef:
+              fieldPath: spec.serviceAccountName
+        - name: POD_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
         - name: POD_NAMESPACE
           valueFrom:
             fieldRef:
displague commented 6 years ago

From that diff, I removed the rook-system changes and the front-matter which defined rook-system (the top 5 lines), and called this manifests/rook-operator.example.yaml.

I then ran:

kubectl replace -f rook-operator.example.yaml

And now the rook-operator is running. I have Persistent Volumes. Some of the other pods I created after the initial install have recovered. I now have failing rook-agent pods - one for each node.

Error: failed to start container "rook-agent": Error response from daemon:
{"message":"mkdir /usr/libexec/kubernetes: read-only file system"}
Error syncing pod
Back-off restarting failed container

Sounds like https://github.com/rook/rook/issues/1120

kahkhang commented 6 years ago

Hi @displague , thanks so much for investigating this issue! It sounds like this is indeed the issue. Either the kubernetes or rook API has changed abit such that it now requires some fixing. Unfortunately, I'm swamped with school now so I'm afraid I won't be able to look deeper into fixing this until mid December. If you wish, feel free to submit a PR until then :)

My hunch is that some flag needs to be added to the kubelet.service systemd file (which can be found at https://github.com/kahkhang/kube-linode/blob/master/manifests/container-linux-config.yaml and https://github.com/kahkhang/kube-linode/blob/master/manifests/container-linux-config-worker.yaml) to include --volume-plugin-dir=/etc/kubernetes/volumeplugins, and also probably the kubernetes version needs some bumping up as well to the latest one which supports flex volume plugins.

Thanks so much once again for highlighting the issue!