kubic-project / issues

Repository for storing issues and feature planning for Kubic
2 stars 1 forks source link

OpenEBS and Kubic FailedMount #13

Closed anthr76 closed 3 years ago

anthr76 commented 3 years ago

Greetings,

Unsure here if anyone here has tested/used openEBS with Kubic but on a fresh cluster install my pods with PVCs are unable to start. My previous cluster install had the same exact issue but I believe rebooting the nodes made the cluster healthy. Before I do that I would like to log the issue here incase this is a OS specific issue rather than anything else. I have made the issue aware in OpenEBS's kubernetes slack.

Steps to reproduce

  1. Install iSCSI with transactional-update, and reboot or install with combustion
  2. Enable/Start iSCSI daemon
  3. Provision a Kubernetes cluster with Kubeadm and Cilium helm chart
  4. Install OpenEBS
  5. Provision OpenEBS for cStore Sparse files
  6. Deploy a service that uses a PVC

Related useful logs:

iSCSId service

worker-02.k8s.rabbito.tech | CHANGED | rc=0 >> ## Pod with PVC is scheduled here
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2021-01-10 17:51:47 UTC; 2h 28min ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 737 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 4915)
     CGroup: /system.slice/iscsid.service
             └─737 /sbin/iscsid -f

Jan 10 20:17:19 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.103.149.160:3260 failed (No route to host)
Jan 10 20:17:38 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.99.196.135:3260 failed (No route to host)
Jan 10 20:17:57 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.99.196.135:3260 failed (No route to host)
Jan 10 20:18:16 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.99.196.135:3260 failed (No route to host)
Jan 10 20:18:42 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.99.196.135:3260 failed (No route to host)
Jan 10 20:18:49 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.99.196.135:3260 failed (No route to host)
Jan 10 20:19:00 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.103.149.160:3260 failed (No route to host)
Jan 10 20:19:37 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.103.43.174:3260 failed (No route to host)
Jan 10 20:19:41 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.103.43.174:3260 failed (No route to host)
Jan 10 20:19:46 worker-02.k8s.rabbito.tech iscsid[737]: iscsid: connect to 10.103.94.208:3260 failed (No route to host)
[WARNING]: Platform linux on host master-03.k8s.rabbito.tech is using the discovered Python interpreter at /usr/bin/python3, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/reference_appendices/interpreter_discovery.html for more information.
master-03.k8s.rabbito.tech | CHANGED | rc=0 >>
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2021-01-10 15:58:42 UTC; 4h 21min ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 25141 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 4471)
     CGroup: /system.slice/iscsid.service
             └─25141 /sbin/iscsid -f

Jan 10 15:58:42 master-03.k8s.rabbito.tech systemd[1]: Starting Open-iSCSI...
Jan 10 15:58:42 master-03.k8s.rabbito.tech systemd[1]: Started Open-iSCSI.
[WARNING]: Platform linux on host master-01.k8s.rabbito.tech is using the discovered Python interpreter at /usr/bin/python3, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/reference_appendices/interpreter_discovery.html for more information.
master-01.k8s.rabbito.tech | CHANGED | rc=0 >>
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2021-01-10 15:58:42 UTC; 4h 21min ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 17442 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 4915)
     CGroup: /system.slice/iscsid.service
             └─17442 /sbin/iscsid -f

Jan 10 15:58:42 master-01.k8s.rabbito.tech systemd[1]: Starting Open-iSCSI...
Jan 10 15:58:42 master-01.k8s.rabbito.tech systemd[1]: Started Open-iSCSI.
[WARNING]: Platform linux on host master-02.k8s.rabbito.tech is using the discovered Python interpreter at /usr/bin/python3, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.9/reference_appendices/interpreter_discovery.html for more information.
master-02.k8s.rabbito.tech | CHANGED | rc=0 >>
● iscsid.service - Open-iSCSI
     Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2021-01-10 15:58:42 UTC; 4h 21min ago
TriggeredBy: ● iscsid.socket
       Docs: man:iscsid(8)
             man:iscsiuio(8)
             man:iscsiadm(8)
   Main PID: 27973 (iscsid)
     Status: "Ready to process requests"
      Tasks: 1 (limit: 4471)
     CGroup: /system.slice/iscsid.service
             └─27973 /sbin/iscsid -f

Jan 10 15:58:42 master-02.k8s.rabbito.tech systemd[1]: Starting Open-iSCSI...

What's interesting is that iSCSId is complaining about no route to host. IS this possibly a issue with cilium that requires a reboot? A gist file with cilium values here.

iSCSI binary locations. There's a few issues over on OpenEBS for certain cloud providers requiring you to modify the Kubelt service for extra_binds and add the path of SCSI.

  1. https://github.com/openebs/openebs/issues/1688#issuecomment-447900301
  2. https://github.com/digitalocean/marketplace-kubernetes/issues/25#issuecomment-535904899
worker-02.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsiadm
worker-01.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsiadm
master-01.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsiadm
master-03.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsiadm
master-02.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsiadm

worker-02.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsid
worker-01.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsid
master-03.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsid
master-01.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsid
master-02.k8s.rabbito.tech | CHANGED | rc=0 >>
/sbin/iscsid

Kubelt service:

cat /usr/lib/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
After=network.target network-online.target
Wants=docker.service crio.service
ConditionPathExists=/var/lib/kubelet/config.yaml

[Service]
ExecStartPre=/bin/bash -c "findmnt -t bpf --mountpoint /sys/fs/bpf > /dev/null || mount bpffs /sys/fs/bpf -t bpf"
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

----

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --volume-plugin-dir=/var/lib/kubelet/volume-plugin"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/sysconfig/kubelet
#FIXME ExecStartPre below is a HACK to work around kernel issue discovered related to boo#1171770
ExecStartPre=/usr/sbin/sysctl -a --system
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

Pod:

  Warning  FailedScheduling        16m                  default-scheduler        0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling        16m                  default-scheduler        0/5 nodes are available: 5 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               15m                  default-scheduler        Successfully assigned monitoring/alertmanager-x-alertmanager-0 to worker-02.k8s.rabbito.tech
  Normal   SuccessfulAttachVolume  15m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-df3979a1-d975-44ba-bc94-4eda7f908fd6"
  Warning  FailedMount             15m                  kubelet                  MountVolume.SetUp failed for volume "config-volume" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount             4m52s (x2 over 13m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[alertmanager-x-alertmanager-db], unattached volumes=[alertmanager-x-alertmanager-db x-alertmanager-token-8m599 config-volume tls-assets]: timed out waiting for the condition
  Warning  FailedMount             2m34s                kubelet                  Unable to attach or mount volumes: unmounted volumes=[alertmanager-x-alertmanager-db], unattached volumes=[tls-assets alertmanager-x-alertmanager-db x-alertmanager-token-8m599 config-volume]: timed out waiting for the condition
  Warning  FailedMount             75s (x15 over 15m)   kubelet                  MountVolume.MountDevice failed for volume "pvc-df3979a1-d975-44ba-bc94-4eda7f908fd6" : format of disk "/dev/disk/by-path/ip-10.100.138.76:3260-iscsi-iqn.2016-09.com.openebs.cstor:pvc-df3979a1-d975-44ba-bc94-4eda7f908fd6-lun-0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/10.100.138.76:3260-iqn.2016-09.com.openebs.cstor:pvc-df3979a1-d975-44ba-bc94-4eda7f908fd6-lun-0") options:("defaults") errcode:(executable file not found in $PATH) output:()
  Warning  FailedMount             18s (x4 over 11m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[alertmanager-x-alertmanager-db], unattached volumes=[config-volume tls-assets alertmanager-x-alertmanager-db x-alertmanager-token-8m599]: timed out waiting for the condition

PVC:

  Normal  Provisioning           16m                openebs.io/provisioner-iscsi_openebs-provisioner-5db9f49c74-g2zpg_24852b24-388b-41a0-b6bc-bece8aadb94f  External provisioner is provisioning volume for claim "monitoring/alertmanager-x-alertmanager-db-alertmanager-x-alertmanager-0"
  Normal  ExternalProvisioning   16m (x2 over 16m)  persistentvolume-controller                                                                             waiting for a volume to be created, either by external provisioner "openebs.io/provisioner-iscsi" or manually created by system administrator
  Normal  ProvisioningSucceeded  16m                openebs.io/provisioner-iscsi_openebs-provisioner-5db9f49c74-g2zpg_24852b24-388b-41a0-b6bc-bece8aadb94f  Successfully provisioned volume pvc-df3979a1-d975-44ba-bc94-4eda7f908fd6

Remedy:

Reboot each OpenEBS node

anthr76 commented 3 years ago

One other thing I would like to add. OpenEBS for whatever reason seems to be tripping up reboot manager. Reboot manager is configured with Kured and I get constant reboots. To remedy this I put kured on a 6 hour window once a week. Is this expected?

thkukuk commented 3 years ago

Your "no route" issues are clearly the reason why iscsi is not working, but I don't know anything about cilium and if that is the problem.

About constant reboots: I doubt that OpenEBS is triggering this. kured is locking for a file "/var/run/reboot-required". If this exists, a reboot is triggered. By default, /var/run is a symlink to /run which is on tmpfs, so after a reboot, the file should be gone. Looks like this is not the case for you. Please check, that this file does not exist anymore after a reboot. If it still does exist, find out why it is not on tmpfs or who re-creates it.

anthr76 commented 3 years ago

Right, so this should be properly filed with cilium? After rebooting the nodes iSCSI comes back fine, regardless of the no route errors Kubelt is saying iSCSI is not found in $PATH

errcode:(executable file not found in $PATH)

Which questions what really is the problem? Kubernetes networking works fine before rebooting to correct the issue.

anthr76 commented 3 years ago

As for the reboots OpenEBS recommends disabling rebootmgr see other. I will dig a bit deeper and see what is causing this sentinal to be set.

OpenEBS is storing data in a sparse file in /var/openebs/* currently /var/run/reboot-required is not present on my node.