microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

An unexpected power outage caused the openpai system to fail to run tasks. #5316

Closed siaimes closed 3 years ago

siaimes commented 3 years ago

Organization Name: Fuzhou University.

Short summary about the issue/question: An unexpected power outage caused the openpai system keep my jobs waitting.

Unable to mount volumes for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": timeout expired waiting for volumes to attach or mount for pod "default"/"b72fdd7a0f7e5d19812107762f386f9c-taskrole-0". list of unmounted volumes=[titanxp32g3-volume]. list of unattached volumes=[dshm pai-vol host-log job-ssh-secret-volume job-exit-spec titanxp32g3-volume runtime-account-token-zgsk4]

I have tried restarting the k8s service, restarting the node, and restarting the entire openpai system, but it didn't help. Brief what process you are following: After an unexpected power outage, then submit job.

How to reproduce it:

OpenPAI Environment:

Anything else we need to know:

hzy46 commented 3 years ago

What is the volume titanxp32g3-volume? Could you check whether this PV works?

siaimes commented 3 years ago

It is the shared directory using nfs, and then the storage configured with pv/pvc. It works well, and I can mount this storage on other CPU machines.

hzy46 commented 3 years ago

But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.

siaimes commented 3 years ago

But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.

How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.

fanyangCS commented 3 years ago

maybe it is the due to the storage service? @Binyang2014 please take a look.

fanyangCS commented 3 years ago

But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.

How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.

there is a storage service in the master node that manages the PV of job pods. please check if the status is ok.

siaimes commented 3 years ago
rzxt@rzxt-openpai-007:~$ sudo systemctl status kubelet.service 
[sudo] password for rzxt: 
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-02-24 02:40:03 CST; 1 day 8h ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 5552 (kubelet)
    Tasks: 0
   Memory: 28.4M
      CPU: 348ms
   CGroup: /system.slice/kubelet.service
           ‣ 5552 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.98 --hostname-override=rzxt-openpai-007 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/googl

Feb 25 11:11:06 rzxt-openpai-007 kubelet[5552]: E0225 11:11:06.367101    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:06 rzxt-openpai-007 kubelet[5552]: I0225 11:11:06.451716    5552 setters.go:73] Using node IP: "172.17.175.98"
Feb 25 11:11:08 rzxt-openpai-007 kubelet[5552]: E0225 11:11:08.374146    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:10 rzxt-openpai-007 kubelet[5552]: E0225 11:11:10.367065    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:12 rzxt-openpai-007 kubelet[5552]: E0225 11:11:12.376115    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:14 rzxt-openpai-007 kubelet[5552]: E0225 11:11:14.367235    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:16 rzxt-openpai-007 kubelet[5552]: E0225 11:11:16.371503    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:16 rzxt-openpai-007 kubelet[5552]: I0225 11:11:16.461606    5552 setters.go:73] Using node IP: "172.17.175.98"
Feb 25 11:11:18 rzxt-openpai-007 kubelet[5552]: E0225 11:11:18.367051    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:20 rzxt-openpai-007 kubelet[5552]: E0225 11:11:20.371045    5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.

Maybe it is caused by this error.

siaimes commented 3 years ago

But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.

How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.

there is a storage service in the master node that manages the PV of job pods. please check if the status is ok.

rzxt@rzxt-openpai-master:~$ sudo docker ps -a | grep storage
e939a123aeed        openpai/internal-storage-create             "/bin/sh -c '/bin/ba…"    33 hours ago        Up 33 hours                                                              k8s_internal-storage-create_internal-storage-create-ds-pqdtp_default_df78ae6e-fe06-4c64-b28b-6da551242810_0
e550c23113a7        gcr.io/google_containers/pause-amd64:3.1    "/pause"                  33 hours ago        Up 33 hours                                                              k8s_POD_internal-storage-create-ds-pqdtp_default_df78ae6e-fe06-4c64-b28b-6da551242810_0
hzy46 commented 3 years ago

@siames Please save the following pod spec to test.yaml. And use kubectl apply -f test.yaml to create it.


apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - image: ubuntu:18.04
    name: test-container
    volumeMounts:
    - name: test-nfs
      mountPath: /data
    command:
      - "sleep 100d"
  volumes:
  - name: test-nfs
    persistentVolumeClaim:
      claimName: <test-nfs-pvc-name>

Please replace <test-nfs-pvc-name> with your PVC's name.

Binyang2014 commented 3 years ago

@Siames, please check this issue: https://github.com/microsoft/pai/issues/4081, and the comments: https://github.com/microsoft/pai/issues/4081#issuecomment-569654558

When storage-server restart, some resources not be released in client side. We need to release them manually. You can follow the comment to fix this issue.

siaimes commented 3 years ago

@Siames, please check this issue: #4081, and the comments: #4081 (comment)

When storage-server restart, some resources not be released in client side. We need to release them manually. You can follow the comment to fix this issue.

I directly mounted the storage when I submitted the jobs, and the job did not start, so there is no way to umount it.

siaimes commented 3 years ago

@Siames Please save the following pod spec to test.yaml. And use kubectl apply -f test.yaml to create it.

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - image: ubuntu:18.04
    name: test-container
    volumeMounts:
    - name: test-nfs
      mountPath: /data
    command:
      - "sleep 100d"
  volumes:
  - name: test-nfs
    persistentVolumeClaim:
      claimName: <test-nfs-pvc-name>

Please replace <test-nfs-pvc-name> with your PVC's name.

Failed.

rzxt@rzxt-openpai-master:~$ nano test.yaml
rzxt@rzxt-openpai-master:~$ kubectl apply -f test.yaml 
error: unable to recognize "test.yaml": an error on the server ("") has prevented the request from succeeding
rzxt@rzxt-openpai-master:~$ cat test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - image: ubuntu:18.04
    name: test-container
    volumeMounts:
    - name: test-nfs
      mountPath: /data
    command:
      - "sleep 100d"
  volumes:
  - name: test-nfs
    persistentVolumeClaim:
      claimName: titanxp32g3-volume
rzxt@rzxt-openpai-master:~$ nano test.yaml
rzxt@rzxt-openpai-master:~$ kubectl apply -f test.yaml 
error: unable to recognize "test.yaml": an error on the server ("") has prevented the request from succeeding
rzxt@rzxt-openpai-master:~$ cat test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - image: ubuntu:18.04
    name: test-container
    volumeMounts:
    - name: test-nfs
      mountPath: /data
    command:
      - "sleep 100d"
  volumes:
  - name: test-nfs
    persistentVolumeClaim:
      claimName: titanxp32g3
rzxt@rzxt-openpai-master:~$ 
Binyang2014 commented 3 years ago

@siaimes, can you log into the problem node, find the NFS mount point and umount it manually? Or you can restart the problem worker node then check if this problem will be fixed.

siaimes commented 3 years ago

@siaimes, can you log into the problem node, find the NFS mount point and umount it manually? Or you can restart the problem worker node then check if this problem will be fixed.

  1. There is no nfs mount point in the system.
rzxt@rzxt-openpai-007:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  123M  3.1G   4% /run
/dev/sda2       220G   47G  163G  23% /
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/sdb1       1.8T  5.0G  1.7T   1% /home
/dev/sda1        93M  3.6M   89M   4% /boot/efi
tmpfs           3.2G     0  3.2G   0% /run/user/1000
  1. The system has been restarted, but it did not take effect.
Binyang2014 commented 3 years ago

Please provide the kubelet log: you can get the log by running journalctl -u kubelet

siaimes commented 3 years ago

Please provide the kubelet log: you can get the log by running journalctl -u kubelet

This is the log of kubelet, please analyze it if it is convenient for you, thank you.

kubelet.log

Binyang2014 commented 3 years ago

In the log, containers this error:

Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188    5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found

Please make sure the PVC titanxp32g3 exists. You can get all PVC by running kubectl get pvc in dev-box

siaimes commented 3 years ago

In the log, containers this error:

Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188    5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found

Please make sure the PVC titanxp32g3 exists. You can get all PVC by running kubectl get pvc in dev-box

root@devbox:/# kubectl get pvc
NAME          STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
gtx1080ti1    Bound    gtx1080ti1-pv    800Gi      RWX                           20h
gtx1080ti2    Bound    gtx1080ti2-pv    1536Gi     RWX                           57d
k40m          Bound    k40m-pv          1536Gi     RWX                           57d
k80           Bound    k80-pv           3Ti        RWX                           57d
rtx2080ti     Bound    rtx2080ti-pv     30Ti       RWX                           37d
share         Bound    share-pv         30Ti       RWX                           37d
titanxp32g1   Bound    titanxp32g1-pv   1536Gi     RWX                           57d
titanxp32g2   Bound    titanxp32g2-pv   1536Gi     RWX                           57d
titanxp32g3   Bound    titanxp32g3-pv   1536Gi     RWX                           20h
titanxp32g4   Bound    titanxp32g4-pv   800Gi      RWX                           57d
titanxp32g5   Bound    titanxp32g5-pv   800Gi      RWX                           57d
titanxp64g1   Bound    titanxp64g1-pv   1536Gi     RWX                           57d
titanxp64g2   Bound    titanxp64g2-pv   800Gi      RWX                           57d
titanxp64g3   Bound    titanxp64g3-pv   800Gi      RWX                           57d
siaimes commented 3 years ago

In the log, containers this error:

Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188    5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found

Please make sure the PVC titanxp32g3 exists. You can get all PVC by running kubectl get pvc in dev-box

I want to delete the pvc and reconfigure it, but this command will not return. After I forcibly end the command, I try to reconfigure it, but it also prompts it already exists.

oot@devbox:/cluster-configuration/storage# kubectl delete -f titanxp32g3.yaml 
persistentvolume "titanxp32g3-pv" deleted
persistentvolumeclaim "titanxp32g3" deleted
^X^C
root@devbox:/cluster-configuration/storage# kubectl create -f titanxp32g3.yaml 
Error from server (AlreadyExists): error when creating "titanxp32g3.yaml": object is being deleted: persistentvolumes "titanxp32g3-pv" already exists
Error from server (AlreadyExists): error when creating "titanxp32g3.yaml": object is being deleted: persistentvolumeclaims "titanxp32g3" already exists
root@devbox:/cluster-configuration/storage# 
Binyang2014 commented 3 years ago

Please try this answer from stackoverflow: https://stackoverflow.com/questions/51358856/kubernetes-cant-delete-persistentvolumeclaim-pvc to delete PVC

siaimes commented 3 years ago

Please try this answer from stackoverflow: https://stackoverflow.com/questions/51358856/kubernetes-cant-delete-persistentvolumeclaim-pvc to delete PVC

Thanks, I have reconfigured pvc, but the problem still exists. One thing to note is that it is possible to submit job with this PVC on a CPU-only machine.

Binyang2014 commented 3 years ago

Please attach the updated kubelet log. Maybe some library is missing in GPU node

siaimes commented 3 years ago

Please attach the updated kubelet log. Maybe some library is missing in GPU node

here is the new log of kubelet. kubelet log

siaimes commented 3 years ago

I have configured PVC for each node, and now more than half of my nodes have this error, but I think the failure is not caused by the PVC itself. The reason is that the task that I submit to using the PVC of the abnormal node on the normal node can be run successfully. However, abnormal nodes can run without PVC, and this error will occur when any PVC is used.

Binyang2014 commented 3 years ago

Can you mount NFS directly in abnormal nodes? Please runsudo mount -v -t nfs4 nfs-server_ip:/nfs_path ./target_path And provide the output

siaimes commented 3 years ago

Can you mount NFS directly in abnormal nodes? Please runsudo mount -v -t nfs4 nfs-server_ip:/nfs_path ./target_path And provide the output

rzxt@rzxt-openpai-007:~$ sudo mount -v -t nfs4 172.17.175.98:/ /data
[sudo] password for rzxt: 
mount.nfs4: timeout set for Fri Feb 26 20:23:30 2021
mount.nfs4: trying text-based options 'addr=172.17.175.98,clientaddr=172.17.175.98'
rzxt@rzxt-openpai-007:~$ df -h
Filesystem       Size  Used Avail Use% Mounted on
udev              16G     0   16G   0% /dev
tmpfs            3.2G  203M  3.0G   7% /run
/dev/sda2        220G   47G  163G  23% /
tmpfs             16G     0   16G   0% /dev/shm
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs             16G     0   16G   0% /sys/fs/cgroup
/dev/sdb1        1.8T  5.0G  1.7T   1% /home
/dev/sda1         93M  3.6M   89M   4% /boot/efi
tmpfs            3.2G     0  3.2G   0% /run/user/1000
172.17.175.98:/  1.8T  5.0G  1.7T   1% /data
Binyang2014 commented 3 years ago

Can you provide the titanxp32g3 pv and pvc config, I mean the titanxp32g3.yaml. The NFS mount works fine. Need to check if config is correct. And please run kubectl describe pv titanxp32g3-pv, kubectl describe pv titanxp32g3-pvc for more information

siaimes commented 3 years ago

Can you provide the titanxp32g3 pv and pvc config, I mean the titanxp32g3.yaml. The NFS mount works fine. Need to check if config is correct. And please run kubectl describe pv titanxp32g3-pv, kubectl describe pv titanxp32g3-pvc for more information

root@devbox:/cluster-configuration/storage# cat titanxp32g3.yaml 
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: titanxp32g3-pv
  labels:
    name: titanxp32g3
spec:
  capacity:
    storage: 1.5Ti
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
    - nfsvers=4.1
  nfs:
    path: /
    server: 172.17.175.98
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: titanxp32g3
  labels:
    share: "false"
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1.5Ti
  selector:
    matchLabels:
      name: titanxp32g3
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3-pv
Name:            titanxp32g3-pv
Labels:          name=titanxp32g3
Annotations:     pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    
Status:          Bound
Claim:           default/titanxp32g3
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        1536Gi
Node Affinity:   <none>
Message:         
Source:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    172.17.175.98
    Path:      /
    ReadOnly:  false
Events:        <none>
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3-pvc
Error from server (NotFound): persistentvolumes "titanxp32g3-pvc" not found
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3    
Name:            titanxp32g3-pv
Labels:          name=titanxp32g3
Annotations:     pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    
Status:          Bound
Claim:           default/titanxp32g3
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        1536Gi
Node Affinity:   <none>
Message:         
Source:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    172.17.175.98
    Path:      /
    ReadOnly:  false
Events:        <none>
siaimes commented 3 years ago

Maybe I found a temporary solution as follows:

sudo rm -r /var/lib/kubelet/pods/26aeceed-147b-4615-90d3-e19f2e235111   # for all Orphaned pods in kubelet log
siaimes commented 3 years ago

Run this script repeatedly until all Orphaned pods are deleted.

#!/bin/bash

TEMP=$(sudo systemctl status kubelet.service grep Orphaned | grep -o  "[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+" | tail -1)

sudo rm -r /var/lib/kubelet/pods/${TEMP}

Update: This script has the risk of losing data.

Binyang2014 commented 3 years ago

Thanks @siaimes, it looks like an issue for k8s. I found there are many similar issues in k8s repo: https://github.com/kubernetes/kubernetes/issues/60987. We will keep track this issue.

siaimes commented 3 years ago

Thanks @siaimes, it looks like an issue for k8s. I found there are many similar issues in k8s repo: kubernetes/kubernetes#60987. We will keep track this issue.

There was no power outage in my data center today, but several machines had the same failure. This temporary solution is not a long-term solution. I don't know why the k8s team has not fixed this bug. Can we fix this bug by ourselves?

siaimes commented 3 years ago

I may have found the reason why this problem occurred without a power outage. The user submits a new job, and the job is stoped by the user when the image is being pulled. However, releasing resources at this time will fail.

rzxt@rzxt-openpai-015:~$ sudo systemctl status kubelet.service 
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-03-04 22:30:21 CST; 18min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 5845 (kubelet)
    Tasks: 0
   Memory: 31.0M
      CPU: 785ms
   CGroup: /system.slice/kubelet.service
           ‣ 5845 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.106 --hostname-override=rzxt-openpai-015 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/

Mar 04 22:48:22 rzxt-openpai-015 kubelet[5845]: I0304 22:48:22.040315    5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [=>                                                 ]  41.85MB/1.608GB"
Mar 04 22:48:24 rzxt-openpai-015 kubelet[5845]: I0304 22:48:24.643073    5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:32 rzxt-openpai-015 kubelet[5845]: I0304 22:48:32.040302    5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [=>                                                 ]  56.88MB/1.608GB"
Mar 04 22:48:34 rzxt-openpai-015 kubelet[5845]: I0304 22:48:34.656844    5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.040283    5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "6d5dc792bcdf: Downloading [============>                                      ]  216.6MB/885.7MB"
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.404006    5845 kubelet_getters.go:177] status for pod nginx-proxy-rzxt-openpai-015 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-03-02 11:41:05 +0800 CST  } {Read
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.607538    5845 endpoint.go:111] State pushed for device plugin github.com/fuse
Mar 04 22:48:44 rzxt-openpai-015 kubelet[5845]: I0304 22:48:44.670189    5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:52 rzxt-openpai-015 kubelet[5845]: I0304 22:48:52.040274    5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [==>                                                ]   90.1MB/1.608GB"
Mar 04 22:48:54 rzxt-openpai-015 kubelet[5845]: I0304 22:48:54.681846    5845 setters.go:73] Using node IP: "172.17.175.106"
rzxt@rzxt-openpai-015:~$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─docker-dns.conf, docker-options.conf, http-proxy.conf
   Active: active (running) since Thu 2021-03-04 22:29:02 CST; 20min ago
     Docs: http://docs.docker.com
 Main PID: 1594 (dockerd)
    Tasks: 0
   Memory: 119.7M
      CPU: 319ms
   CGroup: /system.slice/docker.service
           ‣ 1594 /usr/bin/dockerd --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --dns 10.192.0.3 --dn

Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371590386+08:00" level=warning msg="36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c cleanup: failed to unmount IPC: umount /mnt/docker/containers/36d245354901358b
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907114517+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907210410+08:00" level=warning msg="42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22 cleanup: failed to unmount IPC: umount /mnt/docker/containers/42302ad9fa0d102d
Mar 04 22:29:59 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:59.604464306+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668361253+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668447312+08:00" level=warning msg="dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff cleanup: failed to unmount IPC: umount /mnt/docker/containers/dae97114deceb0a5
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618282760+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618398068+08:00" level=warning msg="2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08 cleanup: failed to unmount IPC: umount /mnt/docker/containers/2d3f060ac46009b5
Mar 04 22:41:44 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:41:44.524715721+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:43:26 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:43:26.541074204+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
rzxt@rzxt-openpai-015:~$ 
Binyang2014 commented 3 years ago

Thanks for this reply. But the error from docker daemon seems not related to NFS. This happens when docker pull image and try to mount the image content. Can you using journalctl -u docker get the full error logs? The message seems be truncated.

And NFS mount happens when container starts, k8s will mount the NFS share and map to container.

siaimes commented 3 years ago

Thanks for this reply. But the error from docker daemon seems not related to NFS. This happens when docker pull image and try to mount the image content. Can you using journalctl -u docker get the full error logs? The message seems be truncated.

And NFS mount happens when container starts, k8s will mount the NFS share and map to container.

-- Logs begin at Thu 2021-03-04 22:28:46 CST, end at Fri 2021-03-05 11:06:23 CST. --
Mar 04 22:29:01 rzxt-openpai-015 systemd[1]: Starting Docker Application Container Engine...
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.049809143+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.049849877+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050230391+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050239770+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050306787+08:00" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  <nil>}]" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050324220+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050327669+08:00" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  <nil>}]" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050350552+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050364824+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420080ee0, CONNECTING" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050374626+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420932100, CONNECTING" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.052276400+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420080ee0, READY" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.052628335+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420932100, READY" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.371628743+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429221305+08:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429388997+08:00" level=warning msg="Your kernel does not support swap memory limit"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429424396+08:00" level=warning msg="Your kernel does not support cgroup rt period"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429431413+08:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429908821+08:00" level=info msg="Loading containers: start."
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.728894651+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.846090675+08:00" level=info msg="Loading containers: done."
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.923707526+08:00" level=info msg="Docker daemon" commit=2d0083d graphdriver(s)=overlay2 version=18.09.7
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.924007478+08:00" level=info msg="Daemon has completed initialization"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.931389900+08:00" level=info msg="API listen on /var/run/docker.sock"
Mar 04 22:29:02 rzxt-openpai-015 systemd[1]: Started Docker Application Container Engine.
Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371526341+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371590386+08:00" level=warning msg="36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c cleanup: failed to unmount IPC: umount /mnt/docker/containers/36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907114517+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907210410+08:00" level=warning msg="42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22 cleanup: failed to unmount IPC: umount /mnt/docker/containers/42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:29:59 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:59.604464306+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668361253+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668447312+08:00" level=warning msg="dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff cleanup: failed to unmount IPC: umount /mnt/docker/containers/dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618282760+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618398068+08:00" level=warning msg="2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08 cleanup: failed to unmount IPC: umount /mnt/docker/containers/2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:41:44 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:41:44.524715721+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:43:26 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:43:26.541074204+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:58:28 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:58:28.505896244+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:03:04 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:04.006664233+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:03:04 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:04.006812302+08:00" level=warning msg="89682056eb99c7cb2ecb0f34727e40deef16f2e3ed15bed43ab86592aeae2cbe cleanup: failed to unmount IPC: umount /mnt/docker/containers/89682056eb99c7cb2ecb0f34727e40deef16f2e3ed15bed43ab86592aeae2cbe/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 23:03:12 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:12.176132005+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.629548539+08:00" level=info msg="Container 5d0bbfd9f26f1c93c57bf65c0f63b01f3c3f362f1493622d0ea03a3986594aba failed to exit within 30 seconds of signal 15 - using the force"
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.743206847+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.871492949+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:33:09 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:09.009068853+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:33:09 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:09.009272168+08:00" level=warning msg="7b4b70bc38d0d887afbf071bf02184ec53c39616ebf67b3a70eba3d3dcf075c4 cleanup: failed to unmount IPC: umount /mnt/docker/containers/7b4b70bc38d0d887afbf071bf02184ec53c39616ebf67b3a70eba3d3dcf075c4/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 23:33:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:18.530873405+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 05 09:51:06 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:06.953423607+08:00" level=info msg="Container 80baa09ae76ea375f6c30854e784e6401f01e76e218fe0afcf8de48dff347ae5 failed to exit within 30 seconds of signal 15 - using the force"
Mar 05 09:51:07 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:07.233128555+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:51:07 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:07.359051322+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:52:11 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:11.129549057+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:52:11 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:11.129675875+08:00" level=warning msg="1c9499ff87f4c2e590e0cc5300268d977f024ab28e7849e4e68e6b125a19e113 cleanup: failed to unmount IPC: umount /mnt/docker/containers/1c9499ff87f4c2e590e0cc5300268d977f024ab28e7849e4e68e6b125a19e113/mounts/shm, flags: 0x2: no such file or directory"
Mar 05 09:52:14 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:14.965708283+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
siaimes commented 3 years ago

@Binyang2014 This problem occurs frequently. Can you continue to track this problem? Even if there is no power outage, this problem usually occurs.

Thank you loyally.

Binyang2014 commented 3 years ago

Sure, I will keep tracking. From the log you provided, I can not figure out the reason. Please provide the info about how you deploy NFS in your cluster. Seems your NFS server and NFS client in the same machine?

siaimes commented 3 years ago

Sure, I will keep tracking. From the log you provided, I can not figure out the reason. Please provide the info about how you deploy NFS in your cluster. Seems your NFS server and NFS client in the same machine?

Thank you for your reply.

  1. If the logs of other software are useful to you, I can also provide.

  2. I installed NFS through sudo apt install nfs-kernel-server.

  3. Yes, I deployed the NFS service on each node, because I don't have a dedicated storage server, and the physical hard disks of each node are not very large.

When this problem occurs, the current node cannot mount the storage on any node, and other nodes can mount the storage of the current node, so the problem may not be caused by the NFS service itself.

siaimes commented 3 years ago

Maybe we can integrate this script into openpai.

OrphanedPodFound

set -x
echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty."

source ${K8S_HOME}/bin/env.sh

IFS=$'\r\n'
for ((i=1; i<=5; i++));
do
    orphanExist="false"
    for podid in `grep "errors similar to this. Turn up verbosity to see them." /var/log/messages | tail -1 | awk '{print $12}' | sed 's/"//g'`;
    do
        echo $podid

        # not process if the volume directory is not exist.
        if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then
            continue
        fi

        # umount subpath if exist
        if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then
            mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'`
            for mntPath in $mountpath;
            do
                echo "umount subpath $mntPath"
                umount $mntPath
            done
        fi

        orphanExist="true"
        volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/`
        for volumeType in $volumeTypes;
        do
            subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType`
            if [ "$subVolumes" != "" ]; then
                echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType contents volume: $subVolumes"
                for subVolume in $subVolumes;
                do
                    # check subvolume path is mounted or not
                    findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    if [ "$?" != "0" ]; then
                        echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, just need to remove"
                        rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    else
                        echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is mounted, umount it"
                        umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                        rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    fi
                done
            else
                rm -rf ${KUBELET_HOME}/kubelet/pods/$podid
            fi
        done
    done
    if [ "$orphanExist" = "false" ]; then
        break
    fi
    sleep 2
done

This script may need to be modified appropriately according to openpai.

siaimes commented 3 years ago

Maybe we can integrate this script into openpai.

OrphanedPodFound

set -x
echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty."

source ${K8S_HOME}/bin/env.sh

IFS=$'\r\n'
for ((i=1; i<=5; i++));
do
    orphanExist="false"
    for podid in `grep "errors similar to this. Turn up verbosity to see them." /var/log/messages | tail -1 | awk '{print $12}' | sed 's/"//g'`;
    do
        echo $podid

        # not process if the volume directory is not exist.
        if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then
            continue
        fi

        # umount subpath if exist
        if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then
            mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'`
            for mntPath in $mountpath;
            do
                echo "umount subpath $mntPath"
                umount $mntPath
            done
        fi

        orphanExist="true"
        volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/`
        for volumeType in $volumeTypes;
        do
            subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType`
            if [ "$subVolumes" != "" ]; then
                echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType contents volume: $subVolumes"
                for subVolume in $subVolumes;
                do
                    # check subvolume path is mounted or not
                    findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    if [ "$?" != "0" ]; then
                        echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, just need to remove"
                        rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    else
                        echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is mounted, umount it"
                        umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                        rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                    fi
                done
            else
                rm -rf ${KUBELET_HOME}/kubelet/pods/$podid
            fi
        done
    done
    if [ "$orphanExist" = "false" ]; then
        break
    fi
    sleep 2
done

This script may need to be modified appropriately according to openpai.

Since the official k8s project did not fix the problem, our solution may monitor the problem and fix it. Maybe it can be integrated into the Prometheus system.

Binyang2014 commented 3 years ago

I think we should summarize the problem here:

  1. When this issue happens, the problem node can not mount NFS though k8s, But Other node can mount it.
  2. If you mount NFS manually in problem node, the mount can be succeed.
  3. Kill the pod and start a new one can not fix this problem
  4. Restart the node can not fix this problem

Are these description correct?

siaimes commented 3 years ago

I think we should summarize the problem here:

  1. When this issue happens, the problem node can not mount NFS though k8s, But Other node can mount it.
  2. If you mount NFS manually in problem node, the mount can be succeed.
  3. Kill the pod and start a new one can not fix this problem
  4. Restart the node can not fix this problem

Are these description correct?

Yes, it's exactly such a performance.

siaimes commented 3 years ago

I had this failure on another machine, so I just had the opportunity to test the above script. I tested and passed the following script.

#!/bin/bash

#set -x

if [[ $(id -u) != 0 ]]; then
    echo Please run this script as root.
    exit 1
fi

echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty."

KUBELET_HOME="/var/lib"

IFS=$'\r\n'
orphanExist="true"
while [ "$orphanExist" = "true" ]
do
    orphanExist="false"
    podid=$(sudo systemctl status kubelet | grep "errors similar to this. Turn up verbosity to see them." | tail -1 | awk '{print $12}' | sed 's/"//g')

    # not process if the no orphaned pod.
    if [ "$podid" = "" ]; then
        continue
    fi

    echo "podid=$podid"

    # not process if the volume directory is not exist.
    if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then
        continue
    fi

    orphanExist="true"

    # umount subpath if exist
    if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then
        mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'`
        for mntPath in $mountpath;
        do
            umount $mntPath
        done
    fi

    volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/`
    for volumeType in $volumeTypes;
    do
        subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType`
        if [ "$subVolumes" != "" ]; then
            for subVolume in $subVolumes;
            do
                # check subvolume path is mounted or not
                findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                if [ "$?" = "0" ]; then
                    # subVolume is mounted, umount it
                    umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
                fi
                # subVolume is not mounted now, just need to remove
                rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
            done
        else
            rm -rf ${KUBELET_HOME}/kubelet/pods/$podid
        fi
    done
    sleep 1
done

echo "Done."

After running this script, you must stop kubelet and restart the operating system, otherwise, there will be errors similar to the following and the task still cannot be started.

sudo systemctl stop kubelet
sudo init 6

image

image

image

Binyang2014 commented 3 years ago

When you met this issue again, please run sudo df -h to check the filesystem type mounted by k8s. And you can change the kubelet log level to 5 to see more detail log.

To change the log level, please edit /etc/kubernetes/kubelet.env Change the KUBE_LOG_LEVEL="--v=5" Then restart the kubelet

I will check the k8s code to find the root cause

siaimes commented 3 years ago

When you met this issue again, please run sudo df -h to check the filesystem type mounted by k8s. And you can change the kubelet log level to 5 to see more detail log.

To change the log level, please edit /etc/kubernetes/kubelet.env Change the KUBE_LOG_LEVEL="--v=5" Then restart the kubelet

I will check the k8s code to find the root cause

Hi @Binyang2014

Unfortunately, this problem has reappeared, so I am back. Below is the information generated according to your description.

rzxt@rzxt-openpai-007:~$ sudo df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G   83M  3.1G   3% /run
/dev/sda2       220G   69G  141G  33% /
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/sda1        93M  3.6M   89M   4% /boot/efi
/dev/sdb1       1.8T  4.1G  1.7T   1% /home
overlay         220G   69G  141G  33% /mnt/docker/overlay2/bde1b548b7647a11350b286db850a09ddb51b1f9271c55dc33c7696311ff8bd5/merged
shm              64M     0   64M   0% /mnt/docker/containers/5dd6bdd0e923591aed3efe67bf9fe858feb738e69255a244fd814d8ce38d1d89/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/3f14f89c321293c1274399016f4c9d73b9ae0ea12361540ba9d3d2f79da986d8/merged
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/e4c89b91-b22f-4564-8fbf-197247ccdf59/volumes/kubernetes.io~secret/kube-proxy-token-x6hkb
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/e1f77ad3-b6a8-4d83-aa99-d9cb63eff083/volumes/kubernetes.io~secret/default-token-g82dc
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/92f7519a-5e68-4d9f-a1c2-412362edae4f/volumes/kubernetes.io~secret/default-token-pfhpt
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/61903405-4315-49c8-9a0c-efcf2417b9e4/volumes/kubernetes.io~secret/default-token-pfhpt
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/e8272832-d7d6-4d23-9fc7-12c432bf8fe5/volumes/kubernetes.io~secret/default-token-g82dc
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/d927dde1-5321-446b-aa92-42ae0978af7a/volumes/kubernetes.io~secret/calico-node-token-qmwg4
tmpfs            16G   12K   16G   1% /var/lib/kubelet/pods/2b755705-2f47-4cee-bac0-fc1b9b6f6608/volumes/kubernetes.io~secret/default-token-g82dc
overlay         220G   69G  141G  33% /mnt/docker/overlay2/4982b0a0969f2fec6e6998a6d94c076f84bd67c5424e756799d302a6a570a260/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/1605115708c329115999fa815199cdc352941bd3a99551a45e0b67f712af61ab/merged
shm              64M     0   64M   0% /mnt/docker/containers/2139eb031d100fae4e6dcd05e68445f23ffa2daa61efd6f2e66f851dcb60c8be/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/ae97c18bb8169a2845a1c84ba148dddef7856b7c30d3daf9ed8ab6a9acf7d58e/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/4df7f6328369e6fb9e2c2610c6aff8ba791ba67ba7aa348bbe7b4e7f26c6ec75/merged
shm              64M     0   64M   0% /mnt/docker/containers/7c6599f1f671eeeb9500b9f3c2693710a7399a5417b820bbd2d22628a06bbc6a/mounts/shm
shm              64M     0   64M   0% /mnt/docker/containers/bff2debd471d27d9905ba1e127b519a194d1c9473cc289769e13abcf53d7a057/mounts/shm
shm              64M     0   64M   0% /mnt/docker/containers/2c23ce5b9147f60436290b42e7e0fe263c96628069fd7178d326d46ac102e021/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/7fa849fe778f6f2652a406631b562d6be4f0150187df55933b377e44f34c9ea5/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/8a6ec82212413a8bf37450bdd75e89e44e757eef5fc5f702272da0d4e223f9a6/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/ed0e98712008917c38efd8487d002c1fa47f7aa96da81410ba15c1ae713650c9/merged
shm              64M     0   64M   0% /mnt/docker/containers/7e28f7968ba49d613b4920e3c4bc9778bf40eeb41282646fd64d9b0a1386e8ad/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/da32b775df7ce45cf1a98bafa73bfd1358be35fdbaa1753d471d55a67634f87e/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/2a074f84d30a0630ca509a7e754d4fcd2530b27593bf64b1d8d1e917a024fa20/merged
shm              64M     0   64M   0% /mnt/docker/containers/7a36361713345af0f9ae5ea9937009ef87a2706b730864ccb40675f32b57b465/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/9fb22ddd226a8e167589cb1559666c2c7ee4d5fbb2bda5d8ac6ab6ace45cb058/merged
shm              64M     0   64M   0% /mnt/docker/containers/0d9f2312ca24da80b08746b8d18c2a50b016058b30ecaab2b8caa083582f5846/mounts/shm
overlay         220G   69G  141G  33% /mnt/docker/overlay2/b3227b9e291f14f53f9e5e7284b771144a74e3d9daf914404530f983674e0ee9/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/65cca2d13a2902fa3220c4352dc64def7ee9eb99994f028f13210b8e4c3b09b0/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/a02b5e89ecdf3872bde70b5d1451b29ef3a51c547a47548995e6f1c37c7bad38/merged
overlay         220G   69G  141G  33% /mnt/docker/overlay2/14a059aeec5d6fdd11c9dfff0e1919ec36aefc6482d9bae5655d71cf82c25519/merged
tmpfs           3.2G     0  3.2G   0% /run/user/1000
overlay         220G   69G  141G  33% /mnt/docker/overlay2/4c80c2df7df8e40f8b3f53b3e723458869ad315df8101c4c1da4a8a7c59034fd/merged

kubelet log

Binyang2014 commented 3 years ago

@siaimes from the log, seems after you restart the kubelet, this issue is disappeared. Maybe you can keep the log level to 5, then provide debug log when met this again. This will be great helpful.

And you can create an issue in kubernates repo, to see if those guys has any idea about this.

siaimes commented 3 years ago

@siaimes from the log, seems after you restart the kubelet, this issue is disappeared. Maybe you can keep the log level to 5, then provide debug log when met this again. This will be great helpful.

Hi, @Binyang2014

I searched for the keywords Stopped Kubernetes Kubelet and Server Orphaned and found that the issue still exists after restart kubelet.

image

image

siaimes commented 3 years ago

And you can create an issue in kubernates repo, to see if those guys has any idea about this.

There have been many similar issues in the Kubernetes project, but no fundamental solution has been obtained. The solution given by the community is what I currently use, that is, delete the issued volume and restart the operating system.

fanyangCS commented 3 years ago

I suspect the problem is on the NFS side. For some reason, the pod cannot umount the NFS server successfully (e.g., https://askubuntu.com/questions/292043/how-to-unmount-nfs-when-server-is-gone). This potentially could be an issue when the storage hardware driver isn't implemented correctly. Please also check the log on the NFS server side to see if you can find some lead.

To find out the root cause, as suggested, please turn on the verbose mode in the kubelet log.

Also, to reduce the problem space, I suggest to remove the CNI following the instruction here: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-remove-the-k8s-network-plugin . OpenPAI does not require CNI.

siaimes commented 3 years ago

I suspect the problem is on the NFS side. For some reason, the pod cannot umount the NFS server successfully (e.g., https://askubuntu.com/questions/292043/how-to-unmount-nfs-when-server-is-gone). This potentially could be an issue when the storage hardware driver isn't implemented correctly. Please also check the log on the NFS server side to see if you can find some lead.

It seems that the nfs service does not output any logs.

image

To find out the root cause, as suggested, please turn on the verbose mode in the kubelet log.

This log is output in verbose mode.

Also, to reduce the problem space, I suggest to remove the CNI following the instruction here: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-remove-the-k8s-network-plugin . OpenPAI does not require CNI.

This introduction has a lot of operations, and it seems that every machine needs to do this operation once. If this is the case, it would be too risky to do it on a production server. Moreover, the last part of the introduction is not detailed enough to modify the configuration file.