Stork not scheduling pods with volumes

fondemen commented 4 years ago

Hello,

I have a small cluster of VMs (3) configured with stork enabled. I have deployments with only one replica for linstor-linstor-stork and linstor-linstor-stork-scheduler, both running on the master node. I took care to align version of stork-scheduler aligned with my K8s version (1.18.3 - Would be ice to add a comment on the kube-linstor/examples/linstor.yaml). However, when running a pod mounting a linstor pvc, the node is scheduled on a node that doesn't hold a replica. In the logs for the stork-scheduler, I have plenty of: E0612 15:15:38.963174 1 leaderelection.go:320] error retrieving resource lock kube-system/stork-scheduler: leases.coordination.k8s.io "stork-scheduler" is forbidden: User "system:serviceaccount:linstor:linstor-linstor-stork-scheduler" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system" I guess there is missing an authorization here...

Cheers

kvaps commented 4 years ago

Hey, thanks for the report!

We're forced to use older version of kube-scheduller for linstor due the upstream bug https://github.com/kubernetes/kubernetes/issues/86281 and formally https://github.com/kubernetes/kubernetes/issues/84169

v1.16.9 is working fine on v1.18.x cluster, so I set it by default.

But I agree that we need to support for newer versions too, so I'm going to add leases.coordination.k8s.io resource to stork-scheduler role.

kvaps commented 4 years ago

Fixed in https://github.com/kvaps/kube-linstor/commit/1bf3eca0a971db06925b68946da4bb5427bcf548, @fondemen please check version from master if it solves your problem. Thanks!

fondemen commented 4 years ago

Thanks for your reactivity ! Unfortunately, not. Same message thrown at me. Now, even if I give full access (clsuetr-admin) to linstor-linstor-stork-scheduler, erroneus access log stops, but still, my pod is scheduled on the wrong node:

vagrant@l01:~$ linstor v l
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    |  Allocated | InUse  |    State |
|===========================================================================================================================================|
| l01  | pvc-7734f1e1-c8e6-436d-b3da-6d60344706da | default              |     0 |    1000 | /dev/drbd1000 | 148.60 MiB | Unused | UpToDate |
| l02  | pvc-7734f1e1-c8e6-436d-b3da-6d60344706da | default              |     0 |    1000 | /dev/drbd1000 | 148.60 MiB | Unused | UpToDate |
| l03  | pvc-7734f1e1-c8e6-436d-b3da-6d60344706da | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |            | InUse  | Diskless |
+-------------------------------------------------------------------------------------------------------------------------------------------+
vagrant@l01:~$ k get pod nginx-deploy-bf9dcc9c9-zw2nq -o wide
NAME                           READY   STATUS    RESTARTS   AGE    IP               NODE   NOMINATED NODE   READINESS GATES
nginx-deploy-bf9dcc9c9-zw2nq   1/1     Running   0          152m   192.168.126.66   l03    <none>           <none>

This is a sandobx 3 nodes cluster with master on l01

kvaps commented 4 years ago

That's an interesting issue, just to be sure:

have you specified
```
spec:
schedulerName: stork
```
for your pod?
Is stork working for you with default kube-scheduler:v1.16.9 image?
Do l01 and l02 nodes have any taints?

fondemen commented 4 years ago

I didn't know that schedulerName parameter, my bad... But still, my pod is scheduled on the wrong node.

Regarding node taint, l01 is master, that's all:

vagrant@l01:~/kube-linstor$ k get no --show-labels 
NAME   STATUS   ROLES    AGE     VERSION   LABELS
l01    Ready    master   4h17m   v1.18.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=l01,kubernetes.io/os=linux,linbit.com/hostname=l01,node-role.kubernetes.io/master=
l02    Ready    <none>   4h13m   v1.18.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=l02,kubernetes.io/os=linux,linbit.com/hostname=l02
l03    Ready    <none>   4h11m   v1.18.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=l03,kubernetes.io/os=linux,linbit.com/hostname=l03

Linstor-related services are all located on master (with nodeSelector) except those belonging to daemon sets:

vagrant@l01:~/kube-linstor$ k -n linstor get pods -o=custom-columns=NAME:.metadata.name,NODE:.spec.nodeName        
NAME                                               NODE
linstor-db-7dbdd66fc5-qmhz8                        l01
linstor-linstor-controller-0                       l01
linstor-linstor-csi-controller-0                   l01
linstor-linstor-csi-node-2fzj7                     l01
linstor-linstor-csi-node-6g7jm                     l03
linstor-linstor-csi-node-xg7w5                     l02
linstor-linstor-satellite-9fphx                    l01
linstor-linstor-satellite-flvtt                    l03
linstor-linstor-satellite-pwxh4                    l02
linstor-linstor-stork-fcc868d4b-scj8z              l01
linstor-linstor-stork-scheduler-546dd9bbcf-dm28x   l01

I tried the default image (1.16.9). In that case, the cluster-admin rights are no longer necessary, stork-scheduler logs look clean... until the pod is run. Here are logs for stork-scheduler: https://gist.github.com/fondemen/0ccdee9c2a4e98d41aadcdf9512101aa

We can see the scheduler is indeed invoked, with some ACL errors regarding event creation.

When applying full rights: https://gist.github.com/fondemen/0c5400ba48a1ac2100db7b040b849c03. No more event generation problems, but still on bad node.

It's weird how it schedules constantly on the bad node (never seen it go on l02, always on l03)...

Just to be sure, here is my storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor"
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "2"
  storagePool: "default"

I also tried with localStoragePolicy= preferred, but linstor-csi complains as if this parameter didn't exist.

fondemen commented 4 years ago

Same behavior on K8s 1.17.6...

Still those E0617 15:11:06.358725 1 reflector.go:153] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:linstor:linstor-linstor-stork-scheduler" cannot list resource "configmaps" in API group "" in the namespace "kube-system" on stork-scheduler that disappear when giving full access.

When scheduling my pod (with schedulerName: stork), stork-scheduler says

Trace[47690483]: [67.512657ms] [67.500912ms] Computing predicates done
Trace[47690483]: [102.43624ms] [34.922473ms] Prioritizing done

no matter if I use scheduler image 1.16.9 or 1.17.6...

kvaps commented 4 years ago

Hey, could you check if the following resources created in your cluster:

kubectl get clusterrole/linstor-linstor-stork-scheduler -o yaml
kubectl get clusterrolebinding/linstor-linstor-stork-scheduler -o yaml

they are templating from this file https://github.com/kvaps/kube-linstor/blob/master/helm/kube-linstor/templates/stork-scheduler-rbac.yaml

fondemen commented 4 years ago

K8s 1.17.6, deployed in linstor namespace kubectl get clusterrole/linstor-linstor-stork-scheduler clusterrolebinding/linstor-linstor-stork-scheduler -o yaml

apiVersion: v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    annotations:
      meta.helm.sh/release-name: linstor
      meta.helm.sh/release-namespace: linstor
    creationTimestamp: "2020-06-17T17:36:09Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: linstor-linstor-stork-scheduler
    resourceVersion: "34192"
    selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/linstor-linstor-stork-scheduler
    uid: cbd78aaf-dfa1-4d7e-a725-b950e1294cc1
  rules:
  - apiGroups:
    - ""
    resources:
    - endpoints
    verbs:
    - get
    - update
  - apiGroups:
    - ""
    resources:
    - configmaps
    verbs:
    - get
  - apiGroups:
    - ""
    resources:
    - events
    verbs:
    - create
    - patch
    - update
  - apiGroups:
    - ""
    resources:
    - endpoints
    verbs:
    - create
  - apiGroups:
    - ""
    resourceNames:
    - kube-scheduler
    resources:
    - endpoints
    verbs:
    - delete
    - get
    - patch
    - update
  - apiGroups:
    - ""
    resources:
    - nodes
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - delete
    - get
    - list
    - watch
  - apiGroups:
    - ""
    resources:
    - bindings
    - pods/binding
    verbs:
    - create
  - apiGroups:
    - ""
    resources:
    - pods/status
    verbs:
    - patch
    - update
  - apiGroups:
    - ""
    resources:
    - replicationcontrollers
    - services
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - '*'
    resources:
    - replicasets
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - apps
    resources:
    - statefulsets
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - policy
    resources:
    - poddisruptionbudgets
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - ""
    resources:
    - persistentvolumeclaims
    - persistentvolumes
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - storage.k8s.io
    resources:
    - storageclasses
    - csinodes
    verbs:
    - get
    - list
    - watch
  - apiGroups:
    - coordination.k8s.io
    resources:
    - leases
    verbs:
    - get
    - create
    - update
- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    annotations:
      meta.helm.sh/release-name: linstor
      meta.helm.sh/release-namespace: linstor
    creationTimestamp: "2020-06-17T17:36:09Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: linstor-linstor-stork-scheduler
    resourceVersion: "34195"
    selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/linstor-linstor-stork-scheduler
    uid: 6ce3d636-cd89-42c2-86b4-51753954c12d
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: linstor-linstor-stork-scheduler
  subjects:
  - kind: ServiceAccount
    name: linstor-linstor-stork-scheduler
    namespace: linstor
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

kvaps commented 4 years ago

I just found that there was indeed missing list verb for stork-scheduler, I fixed it in https://github.com/kvaps/kube-linstor/commit/d25cbba7b5c873d09589a8ee820c0e22f1248d0d

kvaps commented 4 years ago

related PR to upstream project https://github.com/libopenstorage/stork/pull/629

fondemen commented 4 years ago

New error here :

'events.events.k8s.io is forbidden: User "system:serviceaccount:linstor:linstor-linstor-stork-scheduler" cannot create resource "events" in API group "events.k8s.io" in the namespace "default"' (will not retry!)

and later

User "system:serviceaccount:linstor:linstor-linstor-stork-scheduler" cannot patch resource "events" in API group "events.k8s.io" in the namespace "default"' (will not retry!)

though the "create" verb is there for events

fondemen commented 4 years ago

Merely adding

  - apiGroups: ["events.k8s.io"]
    resources: ["events"]
    verbs: ["create", "patch", "update"]

on helm/kube-linstor/templates/stork-scheduler-rbac.yaml seems to solve the issue

but still, my stupid pod is on the wrong node

kvaps commented 4 years ago

Merely adding

  - apiGroups: ["events.k8s.io"]
    resources: ["events"]
    verbs: ["create", "patch", "update"]

Thanks, added events.k8s.io in https://github.com/kvaps/kube-linstor/commit/89f91fa

but still, my stupid pod is on the wrong node

Check your taints:

kubectl get node -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

and your linstor nodes:

linstor n l

fondemen commented 4 years ago

Thanks for your help! Now OK on 1.17 and 1.18!

My taints:

NAME   TAINTS
l01    [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
l02    <none>
l03    <none>

My nodes:

+--------------------------------------------------------+
| Node | NodeType  | Addresses                  | State  |
|========================================================|
| l01  | SATELLITE | 192.168.2.100:3366 (PLAIN) | Online |
| l02  | SATELLITE | 192.168.2.101:3366 (PLAIN) | Online |
| l03  | SATELLITE | 192.168.2.102:3366 (PLAIN) | Online |
+--------------------------------------------------------+

kvaps commented 4 years ago

maybe storage pools?

linstor sp l

fondemen commented 4 years ago

+------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node | Driver   | PoolName    | FreeCapacity | TotalCapacity | CanSnapshots | State |
|============================================================================================================|
| DfltDisklessStorPool | l01  | DISKLESS |             |              |               | False        | Ok    |
| DfltDisklessStorPool | l02  | DISKLESS |             |              |               | False        | Ok    |
| DfltDisklessStorPool | l03  | DISKLESS |             |              |               | False        | Ok    |
| default              | l01  | LVM_THIN | linvg/linlv |    59.75 GiB |     59.75 GiB | True         | Ok    |
| default              | l02  | LVM_THIN | linvg/linlv |    59.75 GiB |     59.75 GiB | True         | Ok    |
| default              | l03  | LVM_THIN | linvg/linlv |    59.75 GiB |     59.75 GiB | True         | Ok    |
+------------------------------------------------------------------------------------------------------------+

kvaps commented 4 years ago

What if you cordon l03 node, will pod be scheduled to l02 or it will stuck on Pending state?

fondemen commented 4 years ago

Good point. But no, it's scheduled on l02. Worse : if I uncordon l03, then delete and recreate my deployment, pod goes l03 again.

After making more tests by adding more couples deployment+pvc, it seems that, most of the time, pods are scheduled on the good node. But most of the time only. I also made some tests by installing stork myself and got similar results (but not 100% sure I did it properly).

I set --debug in stork and got those logs:

     1  time="2020-06-18T08:22:16Z" level=debug msg="Nodes in filter request:" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
     2  time="2020-06-18T08:22:16Z" level=debug msg="l02 [{Type:InternalIP Address:192.168.2.101} {Type:Hostname Address:l02}]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
     3  time="2020-06-18T08:22:16Z" level=debug msg="l03 [{Type:InternalIP Address:192.168.2.102} {Type:Hostname Address:l03}]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
     4  time="2020-06-18T08:22:16Z" level=info msg="called: GetPodVolumes(nginx, default)"
     5  time="2020-06-18T08:22:16Z" level=info msg="called: OwnsPVC(test-pvc)"
     6  time="2020-06-18T08:22:16Z" level=info msg="-> yes"
     7  time="2020-06-18T08:22:16Z" level=info msg="called: InspectVolume(pvc-151e8c5c-7e48-462d-90db-ded27f1d5377)"
     8  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377'
     9  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377/resources'
    10  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377/volume-definitions/0'
    11  time="2020-06-18T08:22:16Z" level=info msg="called: GetNodes()"
    12  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/nodes'
    13  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l01  l01 [192.168.2.100]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    14  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l02  l02 [192.168.2.101]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    15  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l01  l01 [192.168.2.100]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    16  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l02  l02 [192.168.2.101]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    17  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l03  l03 [192.168.2.102]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    18  time="2020-06-18T08:22:16Z" level=debug msg="Nodes in filter response:" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    19  time="2020-06-18T08:22:16Z" level=debug msg="l02 [{Type:InternalIP Address:192.168.2.101} {Type:Hostname Address:l02}]"
    20  time="2020-06-18T08:22:16Z" level=debug msg="l03 [{Type:InternalIP Address:192.168.2.102} {Type:Hostname Address:l03}]"
    21  time="2020-06-18T08:22:16Z" level=debug msg="Nodes in prioritize request:" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    22  time="2020-06-18T08:22:16Z" level=debug msg="[{Type:InternalIP Address:192.168.2.101} {Type:Hostname Address:l02}]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    23  time="2020-06-18T08:22:16Z" level=debug msg="[{Type:InternalIP Address:192.168.2.102} {Type:Hostname Address:l03}]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    24  time="2020-06-18T08:22:16Z" level=info msg="called: GetPodVolumes(nginx, default)"
    25  time="2020-06-18T08:22:16Z" level=info msg="called: OwnsPVC(test-pvc)"
    26  time="2020-06-18T08:22:16Z" level=info msg="-> yes"
    27  time="2020-06-18T08:22:16Z" level=info msg="called: InspectVolume(pvc-151e8c5c-7e48-462d-90db-ded27f1d5377)"
    28  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377'
    29  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377/resources'
    30  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377/volume-definitions/0'
    31  time="2020-06-18T08:22:16Z" level=debug msg="Got driverVolumes: [0xc0000dec40]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    32  time="2020-06-18T08:22:16Z" level=info msg="called: GetNodes()"
    33  [DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://linstor-linstor-controller:3371/v1/nodes'
    34  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l01  l01 [192.168.2.100]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    35  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l02  l02 [192.168.2.101]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    36  time="2020-06-18T08:22:16Z" level=debug msg="nodeInfo: &{l03  l03 [192.168.2.102]    Online}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    37  time="2020-06-18T08:22:16Z" level=debug msg="rackMap: map[l01: l02: l03:]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    38  time="2020-06-18T08:22:16Z" level=debug msg="zoneMap: map[l01: l02: l03:]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    39  time="2020-06-18T08:22:16Z" level=debug msg="regionMap: map[l01: l02: l03:]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    40  time="2020-06-18T08:22:16Z" level=debug msg="Volume pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 allocated on nodes:" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    41  time="2020-06-18T08:22:16Z" level=debug msg="ID: l01 Hostname: l01"
    42  time="2020-06-18T08:22:16Z" level=debug msg="ID: l02 Hostname: l02"
    43  time="2020-06-18T08:22:16Z" level=debug msg="ID: l03 Hostname: l03"
    44  time="2020-06-18T08:22:16Z" level=debug msg="Volume pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 allocated on racks: [  ]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    45  time="2020-06-18T08:22:16Z" level=debug msg="Volume pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 allocated in zones: [  ]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    46  time="2020-06-18T08:22:16Z" level=debug msg="Volume pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 allocated in regions: [  ]" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    47  time="2020-06-18T08:22:16Z" level=debug msg="getNodeScore, let's go" node=l02
    48  time="2020-06-18T08:22:16Z" level=debug msg="rack info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l02
    49  time="2020-06-18T08:22:16Z" level=debug msg="zone info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l02
    50  time="2020-06-18T08:22:16Z" level=debug msg="region info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l02
    51  time="2020-06-18T08:22:16Z" level=debug msg="nodeRack: " node=l02
    52  time="2020-06-18T08:22:16Z" level=debug msg="nodeZone: " node=l02
    53  time="2020-06-18T08:22:16Z" level=debug msg="nodeRegion: " node=l02
    54  time="2020-06-18T08:22:16Z" level=debug msg="node match, returning node priority score (100)" node=l02
    55  time="2020-06-18T08:22:16Z" level=debug msg="getNodeScore, let's go" node=l03
    56  time="2020-06-18T08:22:16Z" level=debug msg="rack info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l03
    57  time="2020-06-18T08:22:16Z" level=debug msg="zone info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l03
    58  time="2020-06-18T08:22:16Z" level=debug msg="region info: &{HostnameMap:map[l01: l02: l03:] PreferredLocality:[  ]}" node=l03
    59  time="2020-06-18T08:22:16Z" level=debug msg="nodeRack: " node=l03
    60  time="2020-06-18T08:22:16Z" level=debug msg="nodeZone: " node=l03
    61  time="2020-06-18T08:22:16Z" level=debug msg="nodeRegion: " node=l03
    62  time="2020-06-18T08:22:16Z" level=debug msg="node match, returning node priority score (100)" node=l03
    63  time="2020-06-18T08:22:16Z" level=debug msg="Nodes in response:" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    64  time="2020-06-18T08:22:16Z" level=debug msg="{Host:l02 Score:100}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76
    65  time="2020-06-18T08:22:16Z" level=debug msg="{Host:l03 Score:100}" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-wbp76

To me, L.43 is the problem: why is the volume believed to be present on l03?

If I curl the linstor-linstor-controller service (curl -k -X 'GET' -H 'Accept: application/json' 'https://10.109.43.182:3371/v1/resource-definitions/pvc-151e8c5c-7e48-462d-90db-ded27f1d5377/volume-definitions/0') I've got an empty response.

Maybe I should add another node and disable linstor on l01...

fondemen commented 4 years ago

Would it be the case that DRBD is a master/slave replication and that as long as the pod cannot be assigned to the master node, it is to be scheduled anywhere else? Is there a mean to check who is the master node for a volume? Is there a mean to avoid master being scheduled on some nodes?

kvaps commented 4 years ago

It seems there were a lot of changes in upstream stork-driver since the last update https://github.com/kvaps/stork/compare/linstor-configurable-endpoint...LINBIT:linstor-driver

I prepared new images with the latest changes:

kvaps/linstor-csi:v1.7.1-3
kvaps/linstor-stork:v1.7.1-3

please try them just to make sure if it wasn't fixed there

kvaps commented 4 years ago

Would it be the case that DRBD is a master/slave replication and that as long as the pod cannot be assigned to the master node, it is to be scheduled anywhere else?

Stork is just seeking for diskful resources and scheduling your pod into same nodes, if possible. You can see them:

linstor r l -r pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 | grep -b Diskless

Is there a mean to check who is the master node for a volume? Is there a mean to avoid master being scheduled on some nodes?

If I remember it correct, all diskful resource are somwhow "master", current primary writes and reads the data to all of them.

fondemen commented 4 years ago

No change with 1.7.1-3.

I've tried linstor v l *-a* before running my pod and I've got:

+---------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    |  Allocated | InUse  |      State |
|=============================================================================================================================================|
| l01  | pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 | default              |     0 |    1000 | /dev/drbd1000 | 148.60 MiB | Unused |   UpToDate |
| l02  | pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 | default              |     0 |    1000 | /dev/drbd1000 | 148.60 MiB | Unused |   UpToDate |
| l03  | pvc-151e8c5c-7e48-462d-90db-ded27f1d5377 | DfltDisklessStorPool |     0 |    1000 | /dev/drbd1000 |            | Unused | TieBreaker |
+---------------------------------------------------------------------------------------------------------------------------------------------+

This means that l03 plays a game regarding my volume : TieBreaker, which would explain why it's considered as schedulable. Might be a linstor driver for stork issue.

I guess I need to perform more tests with more nodes.

kvaps commented 4 years ago

No change with 1.7.1-3.

But it is working fine?

This means that l03 plays a game regarding my volume : TieBreaker, which would explain why it's considered as schedulable. Might be a linstor driver for stork issue.

Yep, try to temporarily disable tiebreaker

linstor c sp DrbdOptions/auto-add-quorum-tiebreaker False

and delete this resource:

linstor r d l03 pvc-151e8c5c-7e48-462d-90db-ded27f1d5377

fondemen commented 4 years ago

I've added 2 more nodes and disabled tiebraker and... pod scheduled on l02 !!! I'll try more tests but looks good ! Yes, with 1.7.1-3.

fondemen commented 4 years ago

I confirm. I've started 3 more deployments and all are scheduled on a proper node. I'm using your new images and running K8s 1.18.4. Now, it might be useful to send an issue to the linstor stork driver.

kvaps commented 4 years ago

There is closed issues board, but I think you can try report it to golinstor project or directly to drbd-user@lists.linbit.com mailing list.

kvaps commented 4 years ago

It seems upstream bug is fixed, I just rebuilt images and updated helm chart, changes already in master FYI

fondemen commented 4 years ago

Thanks. But stork is no longer working. I've got plenty of

2020/06/22 19:48:59 failed to create cluster domains status object for driver linstor: failed to query linstor controller properties: Get "https://localhost:3371/v1/controller/properties": dial tcp 127.0.0.1:3371: connect: connection refused Next retry in: 10s
time="2020-06-22T19:49:09Z" level=info msg="called: GetClusterID()"
[DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://localhost:3371/v1/controller/properties'
time="2020-06-22T19:49:09Z" level=info msg="called: String()"

Of course, scheduling fails:

[DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://localhost:3371/v1/resource-definitions/pvc-9877167d-dea9-4333-955c-e4c5b30e73f4'
time="2020-06-22T20:06:24Z" level=info msg="called: GetPodVolumes(nginx, default)"
time="2020-06-22T20:06:24Z" level=info msg="called: OwnsPVC(test-pvc)"
time="2020-06-22T20:06:24Z" level=info msg="-> yes"
time="2020-06-22T20:06:24Z" level=info msg="called: InspectVolume(pvc-9877167d-dea9-4333-955c-e4c5b30e73f4)"
[DEBUG] curl -X 'GET' -H 'Accept: application/json' 'https://localhost:3371/v1/nodes'
time="2020-06-22T20:06:24Z" level=info msg="called: GetNodes()"
time="2020-06-22T20:06:24Z" level=error msg="Error getting list of driver nodes, returning all nodes: failed to get linstor nodes: Get \"https://localhost:3371/v1/nodes\": dial tcp 127.0.0.1:3371: connect: connection refused" Namespace=default Owner=ReplicaSet/nginx-deploy-6ffc789457 PodName=nginx-deploy-6ffc789457-79sk8
time="2020-06-22T20:06:24Z" level=info msg="called: GetPodVolumes(nginx, default)"
time="2020-06-22T20:06:24Z" level=info msg="called: OwnsPVC(test-pvc)"
time="2020-06-22T20:06:24Z" level=info msg="-> yes"

localhost is clearly the problem here, though LS_ENDPOINT is properly set to 'https://linstor-linstor-controller:3371'

I guess there is a regression here...

kvaps commented 4 years ago

You're right in https://github.com/LINBIT/stork/commit/854a531a893939ded589ac2da825791854980463 LS_ENDPOINT was changed to built-in LS_CONTROLLERS. Fixed in https://github.com/kvaps/kube-linstor/commit/3df6c062d29d33ca2f2fb3ac891ef9bd5c07379b and tested, now stork is working fine for me.

Test instance:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: linstor-volume-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 8Gi
  storageClassName: linstor-1
---
apiVersion: v1
kind: Pod
metadata:
  name: fedora
  namespace: default
spec:
  schedulerName: stork
  containers:
  - name: fedora
    image: fedora
    command: [/bin/bash]
    args: ["-c", "while true; do sleep 10; done"]
    volumeMounts:
    - name: linstor-volume-pvc
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: linstor-volume-pvc
    persistentVolumeClaim:
      claimName: "linstor-volume-pvc"

fondemen commented 4 years ago

Yessss ! Thanks a lot !

kvaps / kube-linstor

Stork not scheduling pods with volumes #13