Closed kmova closed 4 years ago
Logged into the worker-02 node via ssh.
cat /etc/iscsi/initiatorname.iscsi
## DO NOT EDIT OR REMOVE THIS FILE!
## If you remove this file, the iSCSI daemon will not start.
## If you change the InitiatorName, existing access control lists
## may reject this initiator. The InitiatorName must be unique
## for each iSCSI initiator. Do NOT duplicate iSCSI InitiatorNames.
InitiatorName=iqn.1993-08.org.debian:01:b93aa358deea
However, the above doesn't always help, since most of the machines come with same IQN by default
Looked at the kubelet and system (kernel/iscsid) logs (in this case, they were all in syslog - since kubelet was running as service on the host). The following entries show the connection establishment and failure to mount.
Feb 23 14:06:50 worker-02 kubelet[6072]: I0223 14:06:50.637206 6072 operation_generator.go:1111] Controller attach succeeded for volume "pvc-9320
2023-1896-11e8-b8a8-96000007f375" (UniqueName: "kubernetes.io/iscsi/10.102.156.74:3260:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375:0") pod "w
ordpress-55cbcdd99b-nrh5v" (UID: "8b9c9987-1899-11e8-b8a8-96000007f375") device path: ""
Feb 23 14:06:50 worker-02 kubelet[6072]: I0223 14:06:50.736320 6072 operation_generator.go:446] MountVolume.WaitForAttach entering for volume "pv
c-93202023-1896-11e8-b8a8-96000007f375" (UniqueName: "kubernetes.io/iscsi/10.102.156.74:3260:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375:0")
pod "wordpress-55cbcdd99b-nrh5v" (UID: "8b9c9987-1899-11e8-b8a8-96000007f375") DevicePath ""
Feb 23 14:06:50 worker-02 kubelet[6072]: E0223 14:06:50.744831 6072 iscsi_util.go:235] iscsi: failed to rescan session with error: iscsiadm: No session found.
Feb 23 14:06:50 worker-02 kubelet[6072]: (exit status 21)
Feb 23 14:06:51 worker-02 kernel: [349862.236890] scsi host4: iSCSI Initiator over TCP/IP
Feb 23 14:06:51 worker-02 kernel: [349862.755929] scsi 4:0:0:0: Direct-Access CLOUDBYT OPENEBS 0.2 PQ: 0 ANSI: 5
Feb 23 14:06:51 worker-02 kernel: [349862.758792] sd 4:0:0:0: Attached scsi generic sg2 type 0
Feb 23 14:06:51 worker-02 kernel: [349862.759072] sd 4:0:0:0: [sdb] 4194304 512-byte logical blocks: (2.15 GB/2.00 GiB)
Feb 23 14:06:51 worker-02 kernel: [349862.759075] sd 4:0:0:0: [sdb] 4096-byte physical blocks
Feb 23 14:06:51 worker-02 kernel: [349862.759834] sd 4:0:0:0: [sdb] Write Protect is off
Feb 23 14:06:51 worker-02 kernel: [349862.759836] sd 4:0:0:0: [sdb] Mode Sense: 03 00 10 08
Feb 23 14:06:51 worker-02 kernel: [349862.760109] sd 4:0:0:0: [sdb] No Caching mode page found
Feb 23 14:06:51 worker-02 kernel: [349862.763321] sd 4:0:0:0: [sdb] Assuming drive cache: write through
Feb 23 14:06:51 worker-02 kernel: [349862.883977] sd 4:0:0:0: [sdb] Attached SCSI disk
Feb 23 14:06:51 worker-02 iscsid: Connection2:0 to [target: iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375, portal: 10.102.15
6.74,3260] through [iface: default] is operational now
Feb 23 14:06:54 worker-02 kubelet[6072]: E0223 14:06:54.962815 6072 iscsi_util.go:338] iscsi: failed to mount iscsi volume /dev/disk/by-path/ip-1
0.102.156.74:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0 [ext4] to /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-defaul
t/10.102.156.74:3260-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0, error 'fsck' found errors on device /dev/disk/by-path/ip-10.102.156.7
4:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0 but could not correct them: fsck from util-linux 2.27.1
Feb 23 14:06:54 worker-02 kubelet[6072]: /dev/sdb: Superblock has an invalid journal (inode 8).
Feb 23 14:06:54 worker-02 kubelet[6072]: CLEARED.
Feb 23 14:06:54 worker-02 kubelet[6072]: *** ext3 journal has been deleted - filesystem is now ext2 only ***
Feb 23 14:06:54 worker-02 kubelet[6072]: /dev/sdb: One or more block group descriptor checksums are invalid. FIXED.
Feb 23 14:06:54 worker-02 kubelet[6072]: /dev/sdb: Group descriptor 0 checksum is 0x0000, should be 0x9444.
Feb 23 14:06:54 worker-02 kubelet[6072]: /dev/sdb: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Feb 23 14:06:54 worker-02 kubelet[6072]: #011(i.e., without -a or -p options)
Feb 23 14:06:54 worker-02 kubelet[6072]: .
Feb 23 14:06:54 worker-02 kubelet[6072]: E0223 14:06:54.965441 6072 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/iscsi/10.102.1
56.74:3260:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375:0\"" failed. No retries permitted until 2018-02-23 14:06:55.465364682 +0100 CET m=+349
788.010169186 (durationBeforeRetry 500ms). Error: "MountVolume.WaitForAttach failed for volume \"pvc-93202023-1896-11e8-b8a8-96000007f375\" (UniqueName: \"kubernetes.i
o/iscsi/10.102.156.74:3260:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375:0\") pod \"wordpress-55cbcdd99b-nrh5v\" (UID: \"8b9c9987-1899-11e8-b8a
8-96000007f375\") : 'fsck' found errors on device /dev/disk/by-path/ip-10.102.156.74:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-l
un-0 but could not correct them: fsck from util-linux 2.27.1\n/dev/sdb: Superblock has an invalid journal (inode 8).\nCLEARED.\n*** ext3 journal has been deleted - fil
esystem is now ext2 only ***\n\n/dev/sdb: One or more block group descriptor checksums are invalid. FIXED.\n/dev/sdb: Group descriptor 0 checksum is 0x0000, should be
0x9444. \n\n/dev/sdb: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n."
Feb 23 14:06:55 worker-02 kernel: [349866.797554] sd 4:0:0:0: lun280922523394096 has a LUN larger than allowed by the host adapter
Feb 23 14:06:55 worker-02 kubelet[6072]: E0223 14:06:55.663460 6072 iscsi_util.go:338] iscsi: failed to mount iscsi volume /dev/disk/by-path/ip-10.102.156.74:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0 [ext4] to /var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.102.156.74:3260-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0, error 'fsck' found errors on device /dev/disk/by-path/ip-10.102.156.74:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375-lun-0 but could not correct them: fsck from util-linux 2.27.1
Feb 23 14:06:55 worker-02 kubelet[6072]: /dev/sdb: One or more block group descriptor checksums are invalid. FIXED.
Feb 23 14:06:55 worker-02 kubelet[6072]: /dev/sdb: Group descriptor 0 checksum is 0x0000, should be 0x9444.
Feb 23 14:06:55 worker-02 kubelet[6072]: /dev/sdb: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Feb 23 14:06:55 worker-02 kubelet[6072]: #011(i.e., without -a or -p options)
Feb 23 14:06:55 worker-02 kubelet[6072]: .
The openebs volume controller ( iSCSI Target) - only showed the following messages that were inconclusive if the fsck failed due to target errors.
time="2018-02-23T13:05:13Z" level=info msg="10.244.1.9:3260"
time="2018-02-23T13:05:13Z" level=info msg="Accepting ..."
time="2018-02-23T13:05:13Z" level=info msg="connection is connected from 10.244.2.0:50324...\n"
time="2018-02-23T13:05:13Z" level=info msg="Listening ..."
time="2018-02-23T13:05:13Z" level=info msg="New Session initiator name:iqn.1993-08.org.debian:01:b93aa358deea,target name:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375,ISID:0x23d010000"
time="2018-02-23T13:05:19Z" level=error msg=EOF
time="2018-02-23T13:06:50Z" level=info msg="10.244.1.9:3260"
time="2018-02-23T13:06:50Z" level=info msg="Accepting ..."
time="2018-02-23T13:06:50Z" level=info msg="connection is connected from 10.244.3.0:40700...\n"
time="2018-02-23T13:06:50Z" level=info msg="Listening ..."
time="2018-02-23T13:06:50Z" level=warning msg="unexpected connection state: full feature"
time="2018-02-23T13:06:50Z" level=error msg=EOF
time="2018-02-23T13:06:50Z" level=info msg="10.244.1.9:3260"
time="2018-02-23T13:06:50Z" level=info msg="Accepting ..."
time="2018-02-23T13:06:50Z" level=info msg="connection is connected from 10.244.3.0:40702...\n"
time="2018-02-23T13:06:50Z" level=info msg="Listening ..."
time="2018-02-23T13:06:51Z" level=info msg="New Session initiator name:iqn.1993-08.org.debian:01:b93aa358deea,target name:iqn.2016-09.com.openebs.jiva:pvc-93202023-1896-11e8-b8a8-96000007f375,ISID:0x23d020000"
time="2018-02-23T13:06:51Z" level=error msg="non support"
time="2018-02-23T13:06:51Z" level=warning msg="check condition"
time="2018-02-23T13:06:52Z" level=warning msg="check condition"
time="2018-02-23T13:06:52Z" level=warning msg="check condition"
But since this was a new volume, used the following work around to bring the volume online.
Work Around: To get the volume back online, ran fsck /dev/sdb
on the host. Post this the volume became accessible and application started.
I've had this same issue happen for me twice; the first time I figured it was due to disk pressure and using sparse images, but this time around it happened without any disk pressure, and indeed, even without any input from me as far as I can tell. One of the three pods of my elasticsearch cluster decided to die, and when it has come back up, the node can't mount the pvc filesystem.
The fact that this happens when a pod is in an otherwise healthy state is disquieting. Deleting pods is a valid balancing strategy that should not affect the filesystem's consistency, especially if the mounting is handled at the host level.
Is there a way that I can ask a replica to become the master, in case it might work where the current master doesn't?
I use CentOS7 for my kubernetes nodes. The fsck.ext4 that is provided in CentOS7 doesn't support the same features that openebs is using when it provisions the filesystem (is that openebs, or kubernetes, doing that?) -- as a result, I cannot fix this disk without adding third-party software to my node.
I also use Rancher to administer my kubernetes rollout, so I suppose the mkfs for the filesytem could actually be inside of rancher rather than at the CentOS7 OS level.
Fortunately this is a clustered service, so I can delete the pvc, recreate the pod and get back to work, but this would otherwise be a pretty serious problem in production.
One item in the below output that is interesting is that I've asked for the CAS to be provisioned as xfs, but they're being made as ext4 apparently. I'm not sure if that's an issue with rancher or openebs.
pvc:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"4dd1d524-ca93-11e8-afd9-6a03d5095334","leaseDurationSeconds":15,"acquireTime":"2018-10-08T02:51:44Z","renewTime":"2018-10-08T02:51:46Z","leaderTransitions":0}'
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: openebs.io/provisioner-iscsi
creationTimestamp: null
finalizers:
- kubernetes.io/pvc-protection
labels:
app: eskeim
release: eskeim
name: eskeim-data-eskeim-0
selfLink: /api/v1/namespaces/monitoring/persistentvolumeclaims/eskeim-data-eskeim-0
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: openebs-1repl
volumeName: monitoring-eskeim-data-eskeim-0-3882014528
status: {}
storage class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
cas.openebs.io/config: |
- name: ReplicaCount
value: "1"
- name: StoragePool
value: default
creationTimestamp: null
name: openebs-1repl
selfLink: /apis/storage.k8s.io/v1/storageclasses/openebs-1repl
parameters:
openebs.io/fstype: xfs
provisioner: openebs.io/provisioner-iscsi
reclaimPolicy: Delete
volumeBindingMode: Immediate
storage pool:
apiVersion: openebs.io/v1alpha1
kind: StoragePool
metadata:
generation: 1
labels:
openebs.io/version: 0.7.0
name: default
namespace: ""
resourceVersion: ""
selfLink: /apis/openebs.io/v1alpha1/storagepools/default
uid: ""
spec:
path: /var/openebs
I am having this issue occurring consistently in my deployments. It seems to happen when the storage fails to honor an eviction request during a kubectl drain
scenario. The problem is that whatever is provisioning the disk is using a newer version of ext4 that is not supported by CentOS 7.
I tried jumping into the replica pod to scan the filesystem on /openebs/volume-head-000.img, but the fsck in that container is ALSO too old to recognize metadata_csum
.
What provisions the filesystem when the pvc requests it? It has completely ignored my request to use xfs and provisions everything as ext4.
I am facing the same problems. The Pod event states to run fsck manually, but the elasticsearch Pod does not start at all, and I see no other way than entering a running container to run a fsck on that device. Whenever I define a new deployment (e.g., plain busybox) and the PV to be used, it has to be mounted, but the mount fails because of the faulty FS.
@damlub - which OS are you using?
@PeterGrace - The FSType was being ignored in 0.7.0. Fixed it in 0.7.1, the FSType needs to be mentioned under cas.openebs.io/config as shown below:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
cas.openebs.io/config: |
- name: ReplicaCount
value: "1"
- name: StoragePool
value: default
- name: FSType
value: xfs
creationTimestamp: null
name: openebs-1repl
selfLink: /apis/storage.k8s.io/v1/storageclasses/openebs-1repl
provisioner: openebs.io/provisioner-iscsi
reclaimPolicy: Delete
volumeBindingMode: Immediate
@kmova My systems are CentOS 7.5
So far I figured out that the containers of openEBS use a newer version of ext4 - especially the e2fsck - than CentOS. Not sure if this is related. If it would help, I can go with Ubuntu 16.04 or 18.04, too.
Thanks @damlub - Could you try with Ubuntu 16.04, while we check on the CentOS 7.5?
@damlub i deployed openebs on CentOS 7.0 successfully, no issues so far with ext4/xfs. I'm also trying to setup CentOS 7.5 using vagrant but i'm having some issues with it, will keep you posted about the progress.I have used following Vagrantfile to brought up kubernetes cluster.
@damlub i am having issues in bringing up setup (CentOS 7.5) using vagrant or kops. I would like to discuss about the same in details, please join us at openebs-community.slack.com and would you like to provide your slack handle if joined already.
Happened again today.
Warning FailedMount 34s (x11 over 6m) kubelet, k8snode01 MountVolume.MountDevice failed for volume "pvc-42b9c99b-1c36-11e9-a880-060c2990a513" : 'fsck' found errors on device /dev/disk/by-path/ip-10.43.249.129:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-42b9c99b-1c36-11e9-a880-060c2990a513-lun-0 but could not correct them: fsck from util-linux 2.29.2
/dev/sde: One or more block group descriptor checksums are invalid. FIXED.
/dev/sde: Group descriptor 64 checksum is 0x0000, should be 0xce0a.
/dev/sde: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
Warning FailedMount 25s (x3 over 4m) kubelet, k8snode01 Unable to mount volumes for pod "eskeim-1_monitoring(49697f69-1c3f-11e9-a880-060c2990a513)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"eskeim-1". list of unmounted volumes=[eskeim-data]. list of unattached volumes=[eskeim-data default-token-vgt6j]
volume:
$ kubectl describe pv pvc-42b9c99b-1c36-11e9-a880-060c2990a513
Name: pvc-42b9c99b-1c36-11e9-a880-060c2990a513
Labels: openebs.io/cas-type=jiva
openebs.io/storageclass=jiva-1rep
Annotations: openEBSProvisionerIdentity=k8snode02
openebs.io/cas-type=jiva
pv.kubernetes.io/provisioned-by=openebs.io/provisioner-iscsi
Finalizers: [kubernetes.io/pv-protection]
StorageClass: jiva-1rep
Status: Bound
Claim: monitoring/eskeim-data-eskeim-1
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 20Gi
Node Affinity: <none>
Message:
Source:
Type: ISCSI (an ISCSI Disk resource that is attached to a kubelet's host machine and then exposed to the pod)
TargetPortal: 10.43.249.129:3260
IQN: iqn.2016-09.com.openebs.jiva:pvc-42b9c99b-1c36-11e9-a880-060c2990a513
Lun: 0
ISCSIInterface default
FSType: ext4
ReadOnly: false
Portals: []
DiscoveryCHAPAuth: false
SessionCHAPAuth: false
SecretRef: <nil>
InitiatorName: <none>
Events: <none>
The storageClass is explicitly saying to use xfs:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
cas.openebs.io/config: |
- name: ReplicaCount
value: "1"
- name: StoragePool
value: default
- name: FStype
value: xfs
#- name: TargetResourceLimits
# value: |-
# memory: 1Gi
# cpu: 100m
#- name: AuxResourceLimits
# value: |-
# memory: 0.5Gi
# cpu: 50m
#- name: ReplicaResourceLimits
# value: |-
# memory: 2Gi
openebs.io/cas-type: jiva
openebs.io/fstype: xfs
creationTimestamp: 2018-12-27T15:17:00Z
name: jiva-1rep
resourceVersion: "137502"
selfLink: /apis/storage.k8s.io/v1/storageclasses/jiva-1rep
uid: 75597ad9-09ea-11e9-a880-060c2990a513
provisioner: openebs.io/provisioner-iscsi
reclaimPolicy: Delete
volumeBindingMode: Immediate
All of my openebs pods are running 0.8.0:
Image: quay.io/openebs/cstor-pool:0.8.0
Image: quay.io/openebs/cstor-pool-mgmt:0.8.0
Image: quay.io/openebs/cstor-pool:0.8.0
Image: quay.io/openebs/cstor-pool-mgmt:0.8.0
Image: quay.io/openebs/cstor-pool:0.8.0
Image: quay.io/openebs/cstor-pool-mgmt:0.8.0
Image: quay.io/openebs/cstor-pool:0.8.0
Image: quay.io/openebs/cstor-pool-mgmt:0.8.0
Image: quay.io/openebs/m-apiserver:0.8.0
Image: quay.io/openebs/node-disk-manager-amd64:v0.2.0
Image: quay.io/openebs/node-disk-manager-amd64:v0.2.0
Image: quay.io/openebs/node-disk-manager-amd64:v0.2.0
Image: quay.io/openebs/node-disk-manager-amd64:v0.2.0
Image: quay.io/openebs/openebs-k8s-provisioner:0.8.0
Image: quay.io/openebs/snapshot-controller:0.8.0
Image: quay.io/openebs/snapshot-provisioner:0.8.0
Did I mess up the annotation for FSType somehow? I'm not sure why the disk is still using ext4 if I'm explicitly telling it to use xfs.
@PeterGrace can you also help with providing kubelet logs from the node where it has happened and if you are running your own master then grab the kube-controller-manager log also.
@PeterGrace the issue is with your storage class, you should specify annotation FSType
instead of FStype
that's why it was not honoring xfs
.
We have also tried to simulate the (UNEXPECTED INCONSISTENCY) in node-drain scenario multiple times, on three node cluster with 3 replicas using following storage class but we were unable to reproduce it.
Name: openebs-standard
IsDefaultClass: No
Annotations: cas.openebs.io/config=- name: ReplicaCount
value: "3"
- name: FSType
value: "xfs"
Provisioner: openebs.io/provisioner-iscsi
Parameters: <none>
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
kubectl describe pvc:
Name: mongo-jiva-claim-mongo-0
Namespace: default
StorageClass: openebs-standard
Status: Bound
Volume: pvc-0e05b7c2-237d-11e9-b26f-06f90e7ebe0a
Labels: environment=test
openebs.io/replica-anti-affinity=vehicle-db
role=mongo
Annotations: pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
volume.beta.kubernetes.io/storage-provisioner=openebs.io/provisioner-iscsi
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 2G
Access Modes: RWO
Events: <none>
If you find this issue again please get us the following logs:
kubectl get sc
kubectl get pv
kubectl get pvc
journalctl -u kubelet (from the node where volume is getting mounted and also from the other node where it was mounted earlier)
dmesg (from the both nodes as mentioned above)
kubectl logs <ctrl-pods>
Need help in reproducing this issue. Keeping the issue open as couple of users have hit it.
Issues go stale after 90d of inactivity.
kubectl get pods
kubectl describe pod wordpress-55cbcdd99b-nrh5v
kubectl get svc
Used the curl commands to query for the volume status using the cluster-ip 10.102.156.74, which showed that volume - controller and replica's were functional. (Refer #1275)
After seeing that volume was online, checked the openebs volume controller logs:
kubectl logs pvc-93202023-1896-11e8-b8a8-96000007f375-ctrl-7fb796f666-tq7gj Same can be seen from the kubectl cluster-info dump
The above openebs controller logs shows that connection was made from IQN - New Session initiator name:iqn.1993-08.org.debian:01:b93aa358deea. Started to check from "worker-02" where the volume should be mounted.