kadalu / kadalu

A lightweight Persistent storage solution for Kubernetes / OpenShift / Nomad using GlusterFS in background. More information at https://kadalu.tech
https://docs.kadalu.tech/k8s-storage/devel/quick-start/
Other
713 stars 99 forks source link

pvc's stuck in pending state #167

Closed boconnell2210 closed 3 years ago

boconnell2210 commented 4 years ago

OS Version:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

When installing kadula, I am able to see some pvc's created, but others stuck in the pending state.

image

All of the "kadula" pods seem to be in a good state:

image

The 2 rsyslog pvc's, I tried to delete, to see if that would create them, but no luck.

Looking at the pvc's seems it is waiting for a volume to be created:

[root@node1 ~]# kubectl describe pvc -n turbonomic api
Name:          api
Namespace:     turbonomic
StorageClass:  kadalu.replica1
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=xl-example
               app.kubernetes.io/managed-by=Helm
               app.kubernetes.io/name=api
               zone=dmz
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kadalu
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Events:
  Type       Reason                Age                       From                         Message
  ----       ------                ----                      ----                         -------
  Normal     ExternalProvisioning  3m48s (x1761 over 7h23m)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "kadalu" or manually created by system administrator
Mounted By:  api-7bd9c7fc8f-5n8kz

On a pvc that is bound:

[root@node1 ~]# kubectl describe pvc -n turbonomic zookeeper-data
Name:          zookeeper-data
Namespace:     turbonomic
StorageClass:  kadalu.replica1
Status:        Bound
Volume:        pvc-0d50d0c1-4471-11ea-8cad-005056b8c671
Labels:        app.kubernetes.io/instance=xl-example
               app.kubernetes.io/managed-by=Helm
               app.kubernetes.io/name=zookeeper-data
               zone=internal
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kadalu
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      3Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Events:        <none>
Mounted By:    zookeeper-fdb9586cc-wm69p

I am not sure where to look next. If any additional information is needed, please let me know.

I see in one of the logs:

[root@node1 ~]# kubectl logs  -f -n kadalu csi-provisioner-0 kadalu-logging
+------------------------------------------------------------------------------+
[2020-01-31 21:31:34.353524] I [MSGID: 101190] [event-epoll.c:679:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2020-01-31 21:31:34.353956] I [MSGID: 101190] [event-epoll.c:679:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0
[2020-01-31 21:31:34.354327] I [MSGID: 114057] [client-handshake.c:1373:select_server_supported_programs] 0-storage-pool-1-client: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
[2020-01-31 21:31:34.357493] I [MSGID: 114046] [client-handshake.c:1104:client_setvolume_cbk] 0-storage-pool-1-client: Connected to storage-pool-1-client, attached to remote volume '/bricks/storage-pool-1/data/brick'.
[2020-01-31 21:31:34.360932] I [fuse-bridge.c:5162:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.23
[2020-01-31 21:31:34.360967] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0
[2020-02-02 23:34:01.484808] I [MSGID: 114018] [client.c:2341:client_rpc_notify] 0-storage-pool-1-client: disconnected from storage-pool-1-client. Client process will keep trying to connect to glusterd until brick's port is available
[2020-02-02 23:34:01.487895] I [MSGID: 114018] [client.c:2341:client_rpc_notify] 0-storage-pool-1-client: disconnected from storage-pool-1-client. Client process will keep trying to connect to glusterd until brick's port is available
[2020-02-02 23:34:01.488306] I [MSGID: 114018] [client.c:2341:client_rpc_notify] 0-storage-pool-1-client: disconnected from storage-pool-1-client. Client process will keep trying to connect to glusterd until brick's port is available
[2020-02-03 20:25:45.966898] I [MSGID: 100030] [glusterfsd.c:2865:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 7.2 (args: /usr/sbin/glusterfs --process-name fuse -l /var/log/gluster/gluster.log --volfile-id storage-pool-1 -f /kadalu/volfiles/storage-pool-1.client.vol /mnt/storage-pool-1)
[2020-02-03 20:25:47.780084] I [glusterfsd.c:2593:daemonize] 0-glusterfs: Pid of current running process is 33
[2020-02-03 20:25:56.464906] I [rpc-clnt.c:1013:rpc_clnt_connection_init] 0-storage-pool-1-client: setting frame-timeout to 1800
[2020-02-03 20:25:58.818872] W [socket.c:4566:socket_init] 0-storage-pool-1-client: disabling non-blocking IO
[2020-02-03 20:25:58.819469] I [MSGID: 114020] [client.c:2434:notify] 0-storage-pool-1-client: parent translators are ready, attempting connect on transport
[2020-02-03 20:26:05.823593] W [MSGID: 101012] [common-utils.c:3214:gf_get_reserved_ports] 0-glusterfs: could not open the file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports info [No such file or directory]
[2020-02-03 20:26:06.656941] W [MSGID: 101081] [common-utils.c:3254:gf_process_reserved_ports] 0-glusterfs: Not able to get reserved ports, hence there is a possibility that glusterfs may consume reserved port

If there are any other logs that are needed, please let me know.

amarts commented 4 years ago

kubectl logs -n kadalu csi-provision-0 --all-containers would give some more info.

boconnell2210 commented 4 years ago

logs.zip

amarts commented 4 years ago

Thanks for the logs @boconnell2210. I suspect few things, will check them and get back. Meantime, best option right now is to try RWX for now. It is a different volume config (ie, the perf xlators are off), so it would work for even the workloads which needs RWO.

aravindavk commented 4 years ago

Noticed some errors related to storage full and some related to provisioning. I will look into this in detail.

E0203 20:31:17.244710       1 controller.go:700] error syncing claim "turbonomic/arangodb": failed to provision volume with StorageClass "kadalu.replica1": rpc error: code = ResourceExhausted desc = No Hosting Volumes available, add more storage
W0203 20:31:22.123989       1 controller.go:685] Retrying syncing claim "turbonomic/api" because failures 1 < threshold 15
E0203 20:31:22.124092       1 controller.go:700] error syncing claim "turbonomic/api": failed to provision volume with StorageClass "kadalu.replica1": rpc error: code = Unknown desc = Exception calling application: [1] b'' b'mkfs.xfs: /mnt/storage-pool-1/virtblock/58/d0/pvc-0d529361-4471-11ea-8cad-005056b8c671 appears to contain an existing filesystem (xfs).\nmkfs.xfs: Use the -f option to force overwrite.'
I0203 20:31:22.276812       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"turbonomic", Name:"api", UID:"0d529361-4471-11ea-8cad-005056b8c671", APIVersion:"v1", ResourceVersion:"1561", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "kadalu.replica1": rpc error: code = Unknown desc = Exception calling application: [1] b'' b'mkfs.xfs: /mnt/storage-pool-1/virtblock/58/d0/pvc-0d529361-4471-11ea-8cad-005056b8c671 appears to contain an existing filesystem (xfs).\nmkfs.xfs: Use the -f option to force overwrite.'
W0203 20:31:22.319041       1 controller.go:685] Retrying syncing claim "turbonomic/arangodb-apps" because failures 1 < threshold 15
E0203 20:31:22.319118       1 controller.go:700] error syncing claim "turbonomic/arangodb-apps": failed to provision volume with StorageClass "kadalu.replica1": rpc error: code = Unknown desc = Exception calling application: [1] b'' b'mkfs.xfs: /mnt/storage-pool-1/virtblock/46/4f/pvc-0d529cdc-4471-11ea-8cad-005056b8c671 appears to contain an existing filesystem (xfs).\nmkfs.xfs: Use the -f option to force overwrite.'
boconnell2210 commented 4 years ago

Thanks. Storage should not be full, as this was a fresh install. Anything you guys need from me, please let me know. I am not sure RWX is an option for us at this point.

aravindavk commented 4 years ago

Exec into Server Pod using,

$ kubectl exec -it server-storage-pool-1-0-node1-0 /bin/bash -c glusterfsd -n kadalu

and provide us the output of

$ cat /bricks/storage-pool-1/data/brick/.stat
boconnell2210 commented 4 years ago

[root@server-storage-pool-1-0-node1-0 /]# cat /bricks/storage-pool-1/data/brick/.stat
{"size": 265587433472, "free_size": 7889395712}```
aravindavk commented 4 years ago

Thanks for the update, I was suspecting the wrong value due to duplicate update in the stat file about the available size. That is ruled out now. Will look into other areas where it can fail.

amarts commented 4 years ago

@aravindavk isn't the content saying all size available is used up?

{"size": 265587433472, "free_size": 7889395712} (ie, 7,889,395,712 ~= 7.8GB / 265 GB) is available.

boconnell2210 commented 4 years ago

I am doing a fresh install now.

fdisk on /dev/sdb

Disk /dev/sdb: 268.4 GB, 268435456000 bytes, 524288000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Now, I follow the install guide and setup the kadalu namespace and assign the storage

kubectl get pod -n kadalu
NAME                                   READY   STATUS    RESTARTS   AGE
csi-nodeplugin-pvpzm                   3/3     Running   0          46m
csi-provisioner-0                      4/4     Running   0          46m
operator-5c8d499847-486xw              1/1     Running   0          48m
server-turbo-storage-pool1-0-node1-0   2/2     Running   0          3m31s
[turbo@node1 bin]$ kubectl get sc
NAME                        PROVISIONER   AGE
kadalu                      kadalu        61m
kadalu.replica1 (default)   kadalu        61m
kadalu.replica3             kadalu        61m
kubectl describe sc -n kadalu kadalu.replica1
Name:                  kadalu.replica1
IsDefaultClass:        Yes
Annotations:           storageclass.kubernetes.io/is-default-class=true
Provisioner:           kadalu
Parameters:            hostvol_type=Replica1
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

Going to bring up my application now and bind to the storage. Taking a while to bind the pvc's:

kubectl get pvc
NAME                   STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
api                    Bound     pvc-5cc956a1-4d14-11ea-879c-005056b84b91   1Gi        RWO            kadalu.replica1   10m
api-certs              Pending                                                                        kadalu.replica1   10m
arangodb               Pending                                                                        kadalu.replica1   10m
arangodb-apps          Bound     pvc-5cc7b7bd-4d14-11ea-879c-005056b84b91   2Gi        RWO            kadalu.replica1   10m
arangodb-dump          Pending                                                                        kadalu.replica1   10m
auth                   Bound     pvc-5cc847ad-4d14-11ea-879c-005056b84b91   1Gi        RWO            kadalu.replica1   10m
consul-data            Pending                                                                        kadalu.replica1   10m
kafka-log              Pending                                                                        kadalu.replica1   10m
rsyslog-auditlogdata   Bound     pvc-5cc8557e-4d14-11ea-879c-005056b84b91   30Gi       RWO            kadalu.replica1   10m
rsyslog-syslogdata     Bound     pvc-5cc72442-4d14-11ea-879c-005056b84b91   30Gi       RWO            kadalu.replica1   10m
topology-processor     Pending                                                                        kadalu.replica1   10m
zookeeper-data         Bound     pvc-5cc72cb3-4d14-11ea-879c-005056b84b91   3Gi        RWO            kadalu.replica1   10m

Currently, looking at the storage pod: cat /bricks/turbo-storage-pool1/data/brick/.stat


cat /bricks/turbo-storage-pool1/data/brick/.stat
{"size": 265587433472, "free_size": 135664672768}```

Here is a description on one of the pending volumes:
```VolumeMode:    Filesystem
Events:
  Type       Reason                Age                  From                                                           Message
  ----       ------                ----                 ----                                                           -------
  Normal     Provisioning          4m32s (x7 over 22m)  kadalu_csi-provisioner-0_f9bebcea-4d07-11ea-ad4f-b2540224bdc6  External provisioner is provisioning volume for claim "turbonomic/kafka-log"
  Warning    ProvisioningFailed    4m31s (x7 over 20m)  kadalu_csi-provisioner-0_f9bebcea-4d07-11ea-ad4f-b2540224bdc6  failed to provision volume with StorageClass "kadalu.replica1": rpc error: code = Unknown desc = Exception calling application: [1] b'' b'mkfs.xfs: /mnt/turbo-storage-pool1/virtblock/3b/81/pvc-5cc803b2-4d14-11ea-879c-005056b84b91 appears to contain an existing filesystem (xfs).\nmkfs.xfs: Use the -f option to force overwrite.'
  Normal     ExternalProvisioning  3m9s (x84 over 23m)  persistentvolume-controller                                    waiting for a volume to be created, either by external provisioner "kadalu" or manually created by system administrator
Mounted By:  kafka-766cff9f5-85v4l```

kubectl logs -n kadalu csi-provisioner-0 --all-containers  --- attached.
[csi-provision-logs.zip](https://github.com/kadalu/kadalu/files/4188614/csi-provision-logs.zip)

Is there anything else I can provide?
stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is idle since 180 days! There is a possibility that this issue is already fixed in later releases. Please upgrade and check! If I don't hear any update in this issue in next 2 weeks, will be closing the issue. That doesn't mean one can't re-open the issue! Just comment on the issue, and click 'Reopen', if you still have the issue.

aravindavk commented 4 years ago

PV size accounting was rewritten with the PR https://github.com/kadalu/kadalu/pull/268

This can be closed after 0.8.0 release.

amarts commented 3 years ago

With 0.7.6 release most of these issues are fixed. Please reopen (or create another issue) if any bugs are seen.