Open shuo-wu opened 3 years ago
I have checked the support bundle (the user linked it to issue #2829 in the email). The attachment looks fine on the Longhorn side. Here is the analysis:
For the RWX volume alertmanager-data
:
2021-08-10T09:58:44.518418375Z I0810 09:58:44.517594 1 connection.go:183] GRPC request: {"node_id":"sky-210015-ds","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"diskSelector":"","nodeSelector":"","numberOfReplicas":"3","staleReplicaTimeout":"20"},"volume_id":"alertmanager-data"}
2021-08-10T09:58:44.519989195Z time="2021-08-10T09:58:44Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"sky-210015-ds\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"diskSelector\":\"\",\"nodeSelector\":\"\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"20\"},\"volume_id\":\"alertmanager-data\"}"
2021-08-10T09:58:49.404877319Z time="2021-08-10T09:58:49Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"alertmanager-data\", UID:\"6de827c9-4482-49d6-b308-2ae7b919d6d2\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"21360477\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume alertmanager-data has been attached to sky-210004-ds"
2021-08-10T09:58:44.657043679Z time="2021-08-10T09:58:44Z" level=info msg="Created pod for share manager" controller=longhorn-share-manager node=sky-210004-ds owner=sky-210004-ds pod=share-manager-alertmanager-data shareManager=alertmanager-data state=starting volume=alertmanager-data
The problem is, somehow your cluster spent 4 minutes on starting this share manager pod (from 2021-08-10T09:58:44
to 2021-08-10T10:02:55
). There is no error log for this share manager pod on the Longhorn side, which means the share manager workflow runs as expected.
Here is the pod state. No failure or error during initialization:
status:
phase: Running
conditions:
- type: Initialized
status: "True"
lastprobetime: "0001-01-01T00:00:00Z"
lasttransitiontime: "2021-08-10T10:02:05Z"
reason: ""
message: ""
- type: Ready
status: "True"
lastprobetime: "0001-01-01T00:00:00Z"
lasttransitiontime: "2021-08-10T10:02:55Z"
reason: ""
message: ""
- type: ContainersReady
status: "True"
lastprobetime: "0001-01-01T00:00:00Z"
lasttransitiontime: "2021-08-10T10:02:55Z"
reason: ""
message: ""
- type: PodScheduled
status: "True"
lastprobetime: "0001-01-01T00:00:00Z"
lasttransitiontime: "2021-08-10T10:02:05Z"
reason: ""
message: ""
message: ""
reason: ""
nominatednodename: ""
hostip: 172.20.0.2
podip: 10.42.3.220
podips:
- ip: 10.42.3.220
starttime: "2021-08-10T10:02:05Z"
initcontainerstatuses: []
containerstatuses:
- name: share-manager
state:
waiting: null
running:
startedat: "2021-08-10T10:02:06Z"
terminated: null
lastterminationstate:
waiting: null
running: null
terminated: null
ready: true
restartcount: 0
image: longhornio/longhorn-share-manager:v1_20210416
imageid: docker-pullable://longhornio/longhorn-share-manager@sha256:c0528766fa0fd8c21632e76b0a79844fa484b3f58c53c855ac4575f4076abad1
containerid: docker://c40ef6a3b5a25665e5a3a19af96bcdb2241027324cdd3d69cc587ebcfbf63208
started: true
qosclass: BestEffort
ephemeralcontainerstatuses: []
The next is:
2021-08-10T09:58
to 2021-08-10T10:03
cc @qdupuy
Environment:
Longhorn version: v1.1.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.20.9-rancher1-1
Number of management node in the cluster: 3
Number of worker node in the cluster: 1
Node config
OS type and version: Ubuntu 21.04
CPU per node: 12CPU
Memory per node: 32Go
Disk type(e.g. SSD/NVMe): NVMe
Network bandwidth between the nodes: 1000 Mbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Bare-metal
Number of Longhorn volumes in the cluster: 20
I was able to reproduce the bug with the RWO mode.
I set up 2 replicas and migrated my pods to another node via the "nodeName: server" condition
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedAttachVolume 14m attachdetach-controller Multi-Attach error for volume "pvc-a11eda9a-12ac-401e-96ec-1664690ce302" Volume is already used by pod(s) X-8658649d77-76jpb
Warning FailedMount 11m kubelet Unable to attach or mount volumes: unmounted volumes=[X], unattached volumes=[default-token-n56xz X]: timed out waiting for the condition
Warning FailedMount 71s (x3 over 9m12s) kubelet Unable to attach or mount volumes: unmounted volumes=[X], unattached volumes=[X default-token-n56xz]: timed out waiting for the condition
What more do you need so that we can correct the problem?
The case you reproduced (pod migration) is different from the original issue... We can check the pod migration case later. Now can you provide the kubelet log on node sky-210004-ds
? The log should contain the message around 2021-08-10T09:58Z
to 2021-08-10T10:03Z
Some logs :
Aug 10 06:01:05 eu-west-1 kernel: [1525043.747111] EXT4-fs warning (device sdg): htree_dirblock_to_tree:993: inode #12: lblock 0: comm kubelet: error -5 reading directory block
Aug 10 06:01:05 eu-west-1 kernel: [1525043.747245] EXT4-fs warning (device sde): htree_dirblock_to_tree:993: inode #12: lblock 0: comm kubelet: error -5 reading directory block
Aug 10 06:01:05 eu-west-1 kernel: [1525043.751945] EXT4-fs warning (device sde): htree_dirblock_to_tree:993: inode #12: lblock 0: comm kubelet: error -5 reading directory block
Aug 10 06:01:05 eu-west-1 kernel: [1525043.752429] EXT4-fs warning (device sdg): htree_dirblock_to_tree:993: inode #12: lblock 0: comm kubelet: error -5 reading directory block
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-e07c8dbc\x2d665f\x2d4e25\x2d9da4\x2da5af096b7674-volume\x2dsubpaths-pvc\x2d2074432e\x2d05d2\x2d4c6e\x2db30c\x2de3c5c28bce67-volume\x2dpermissions-0.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-8aba8446\x2d5bc8\x2d4381\x2d99a2\x2d5bc1aa8c1ee6-volume\x2dsubpaths-pvc\x2dd62d4874\x2d0769\x2d4603\x2d9b29\x2dce89fa4c57fe-volume\x2dpermissions-0.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-8aba8446\x2d5bc8\x2d4381\x2d99a2\x2d5bc1aa8c1ee6-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-e07c8dbc\x2d665f\x2d4e25\x2d9da4\x2da5af096b7674-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-e07c8dbc\x2d665f\x2d4e25\x2d9da4\x2da5af096b7674-volume\x2dsubpaths-pvc\x2d2074432e\x2d05d2\x2d4c6e\x2db30c\x2de3c5c28bce67-wordpress-0.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-8aba8446\x2d5bc8\x2d4381\x2d99a2\x2d5bc1aa8c1ee6-volume\x2dsubpaths-pvc\x2dd62d4874\x2d0769\x2d4603\x2d9b29\x2dce89fa4c57fe-wordpress-0.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-e07c8dbc\x2d665f\x2d4e25\x2d9da4\x2da5af096b7674-volumes-kubernetes.io\x7ecsi-pvc\x2d2074432e\x2d05d2\x2d4c6e\x2db30c\x2de3c5c28bce67-mount.mount: Succeeded.
Aug 10 06:01:05 eu-west-1 systemd[1]: var-lib-kubelet-pods-8aba8446\x2d5bc8\x2d4381\x2d99a2\x2d5bc1aa8c1ee6-volumes-kubernetes.io\x7ecsi-pvc\x2dd62d4874\x2d0769\x2d4603\x2d9b29\x2dce89fa4c57fe-mount.mount: Succeeded.
Aug 10 06:02:41 eu-west-1 systemd[1]: var-lib-kubelet-pods-8aba8446\x2d5bc8\x2d4381\x2d99a2\x2d5bc1aa8c1ee6-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:03:30 eu-west-1 systemd[1]: var-lib-kubelet-pods-e07c8dbc\x2d665f\x2d4e25\x2d9da4\x2da5af096b7674-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 06:18:45 eu-west-1 systemd[1]: var-lib-kubelet-pods-22ad2309\x2dfaf1\x2d4651\x2d84af\x2de20132fcffc6-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:19:33 eu-west-1 systemd[1]: var-lib-kubelet-pods-d975b0c3\x2de4e8\x2d4090\x2d99c5\x2dcd6f03f8bf70-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 06:25:10 eu-west-1 systemd[1]: var-lib-kubelet-pods-c086c029\x2d81ab\x2d4f79\x2dbd98\x2db9456ded9468-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 06:25:10 eu-west-1 systemd[1]: var-lib-kubelet-pods-e3cd18cc\x2db300\x2d41d5\x2da7d5\x2d8d909498dc07-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:34:00 eu-west-1 systemd[1]: var-lib-kubelet-pods-11b4bc63\x2d3486\x2d47ab\x2db1f9\x2d0bb4935aad90-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:35:36 eu-west-1 systemd[1]: var-lib-kubelet-pods-48c17ee7\x2dd214\x2d495d\x2db252\x2d260ec4614d17-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 06:42:01 eu-west-1 systemd[1]: var-lib-kubelet-pods-1a2d8a07\x2d8eea\x2d404a\x2db755\x2d937cdedec814-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dsj4cz.mount: Succeeded.
Aug 10 06:43:38 eu-west-1 systemd[1]: var-lib-kubelet-pods-24d48432\x2d7379\x2d4de9\x2d9614\x2d5688127cea6d-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2dw6xkl.mount: Succeeded.
Aug 10 10:37:51 eu-west-1 systemd[1]: var-lib-kubelet-pods-88aefca1\x2d25d8\x2d403b\x2d94f7\x2d9a102fd2e4e6-volumes-kubernetes.io\x7esecret-grafana\x2dtoken\x2dmhprw.mount: Succeeded.
Aug 10 10:37:51 eu-west-1 systemd[1]: var-lib-kubelet-pods-88aefca1\x2d25d8\x2d403b\x2d94f7\x2d9a102fd2e4e6-volumes-kubernetes.io\x7esecret-grafana\x2dauth\x2dgoogle\x2dsecret\x2dmount.mount: Succeeded.
Aug 10 10:38:44 eu-west-1 systemd[1]: var-lib-kubelet-pods-88aefca1\x2d25d8\x2d403b\x2d94f7\x2d9a102fd2e4e6-volumes-kubernetes.io\x7ecsi-grafana\x2ddata-mount.mount: Succeeded.
Aug 10 10:58:43 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7esecret-grafana\x2dtoken\x2dmhprw.mount: Succeeded.
Aug 10 10:58:43 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7esecret-grafana\x2dauth\x2dgoogle\x2dsecret\x2dmount.mount: Succeeded.
Aug 10 10:58:43 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volume\x2dsubpaths-config-grafana-0.mount: Succeeded.
Aug 10 10:58:43 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7ecsi-grafana\x2ddata-mount.mount: Succeeded.
Aug 10 10:58:43 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volume\x2dsubpaths-config-grafana-2.mount: Succeeded.
Aug 10 11:00:20 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7esecret-grafana\x2dauth\x2dgoogle\x2dsecret\x2dmount.mount: Succeeded.
Aug 10 11:00:20 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7esecret-grafana\x2dtoken\x2dmhprw.mount: Succeeded.
Aug 10 11:00:20 eu-west-1 systemd[1]: var-lib-kubelet-pods-2a6bcf49\x2d7d44\x2d427d\x2dbce6\x2dd9d810557429-volumes-kubernetes.io\x7ecsi-grafana\x2ddata-mount.mount: Succeeded.
Aug 10 11:43:06 eu-west-1 systemd[1]: Started Kubernetes transient mount for /var/lib/kubelet/pods/5c2e193a-a608-441e-bf93-50525a35c509/volumes/kubernetes.io~csi/pvc-82910441-daa5-4179-8da7-e1283a8030b6/mount.
Aug 10 11:44:27 eu-west-1 systemd[1]: var-lib-kubelet-pods-2bc35293\x2d588d\x2d43c9\x2d8f7d\x2da360ba0700c2-volumes-kubernetes.io\x7esecret-longhorn\x2dservice\x2daccount\x2dtoken\x2dvg5sn.mount: Succeeded.
sky-210004-ds is a worker node.
This command docker logs kubelet --since 2021-08-10T09:58Z --until 2021-08-10T18:03Z
provide nothing onto sky-210004-ds.
However, on the other nodes I have a result. Do you want them?
The timestamps I provide may not match the format. There should be logs in kubelet during pod creation.
In the logs into sky-210004-ds, I see this very often :
E0816 07:35:21.574295 1811601 reconciler.go:193] operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume "storage" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^grafana-data") pod "63382382-76a5-4165-a514-b0628fddb153" (UID: "63382382-76a5-4165-a514-b0628fddb153") : UnmountVolume.NewUnmounter failed for volume "storage" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^grafana-data") pod "63382382-76a5-4165-a514-b0628fddb153" (UID: "63382382-76a5-4165-a514-b0628fddb153") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/63382382-76a5-4165-a514-b0628fddb153/volumes/kubernetes.io~csi/grafana-data/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/63382382-76a5-4165-a514-b0628fddb153/volumes/kubernetes.io~csi/grafana-data/vol_data.json]: open /var/lib/kubelet/pods/63382382-76a5-4165-a514-b0628fddb153/volumes/kubernetes.io~csi/grafana-data/vol_data.json: no such file or directory
Based on the error message, the issue triggered by volume storage
is related to https://github.com/kubernetes/kubernetes/issues/101911
I will check what happens and if it gets fixed later.
BTW, volume storage
does not exist in the support bundle. I am not sure what happens for the volume. Hence I prefer to check the volume I mentioned above.
Hello,
do you want me to regenerate a new support bundle?
Does your current cluster still contain the info/logs about volume storage
? If YES, please re-generate it.
I have redone a knot, and I still have several errors on longhorn that I have already given on this and the other topic but also on slack.
I will send you a new email.
etcd logs :
2021-08-16 13:47:30.632868 W | etcdserver: read-only range request "key:\"/registry/leases/longhorn-system/external-attacher-leader-driver-longhorn-io\" " with result "range_response_count:1 size:538" took too long (196.88205ms) to execute
2021-08-16 13:47:31.153097 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:522" took too long (222.49227ms) to execute
2021-08-16 13:47:31.381512 W | etcdserver: read-only range request "key:\"/registry/podsecuritypolicy/\" range_end:\"/registry/podsecuritypolicy0\" count_only:true " with result "range_response_count:0 size:9" took too long (260.703295ms) to execute
2021-08-16 13:47:31.381646 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:494" took too long (390.475513ms) to execute
2021-08-16 13:47:31.907976 W | etcdserver: read-only range request "key:\"/registry/events/longhorn-system/pvc-a11eda9a-12ac-401e-96ec-1664690ce302-s1h-c-1629121440-t7xqd.169bcd717a313867\" " with result "range_response_count:1 size:977" took too long (222.982466ms) to execute
2021-08-16 13:47:31.914962 W | etcdserver: read-only range request "key:\"/registry/crd.projectcalico.org/globalnetworksets/\" range_end:\"/registry/crd.projectcalico.org/globalnetworksets0\" count_only:true " with result "range_response_count:0 size:7" took too long (143.578683ms) to execute
2021-08-16 13:47:31.914979 W | etcdserver: read-only range request "key:\"/registry/leases/kube-system/kube-scheduler\" " with result "range_response_count:1 size:494" took too long (227.134138ms) to execute
2021-08-16 13:49:40.912039 I | mvcc: store.index: compact 24058185
up
In the support bundle, there is neither volume storage
nor the attachment/mount error. Please:
Describe the bug Launching a workload that requires a new RWX volume takes a long time (4~5 minutes). The direct cause is the unexpected slow RWX volume attachment. Here is the original issue report: https://rancher-users.slack.com/archives/CC2UQM49Y/p1628589991044200?thread_ts=1628252452.025500&cid=CC2UQM49Y
To Reproduce Steps to reproduce the behavior:
Expected behavior The RWX volume can be attached in 5~30 seconds typically.
Log
Environment:
Additional context Add any other context about the problem here.