Closed siaimes closed 3 years ago
What is the volume titanxp32g3-volume
? Could you check whether this PV works?
It is the shared directory using nfs, and then the storage configured with pv/pvc. It works well, and I can mount this storage on other CPU machines.
But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.
But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.
How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.
maybe it is the due to the storage service? @Binyang2014 please take a look.
But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.
How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.
there is a storage service in the master node that manages the PV of job pods. please check if the status is ok.
rzxt@rzxt-openpai-007:~$ sudo systemctl status kubelet.service
[sudo] password for rzxt:
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-02-24 02:40:03 CST; 1 day 8h ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 5552 (kubelet)
Tasks: 0
Memory: 28.4M
CPU: 348ms
CGroup: /system.slice/kubelet.service
‣ 5552 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.98 --hostname-override=rzxt-openpai-007 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/googl
Feb 25 11:11:06 rzxt-openpai-007 kubelet[5552]: E0225 11:11:06.367101 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:06 rzxt-openpai-007 kubelet[5552]: I0225 11:11:06.451716 5552 setters.go:73] Using node IP: "172.17.175.98"
Feb 25 11:11:08 rzxt-openpai-007 kubelet[5552]: E0225 11:11:08.374146 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:10 rzxt-openpai-007 kubelet[5552]: E0225 11:11:10.367065 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:12 rzxt-openpai-007 kubelet[5552]: E0225 11:11:12.376115 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:14 rzxt-openpai-007 kubelet[5552]: E0225 11:11:14.367235 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:16 rzxt-openpai-007 kubelet[5552]: E0225 11:11:16.371503 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:16 rzxt-openpai-007 kubelet[5552]: I0225 11:11:16.461606 5552 setters.go:73] Using node IP: "172.17.175.98"
Feb 25 11:11:18 rzxt-openpai-007 kubelet[5552]: E0225 11:11:18.367051 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Feb 25 11:11:20 rzxt-openpai-007 kubelet[5552]: E0225 11:11:20.371045 5552 kubelet_volumes.go:154] Orphaned pod "26aeceed-147b-4615-90d3-e19f2e235111" found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.
Maybe it is caused by this error.
But it is the reason why openpai jobs cannot start. Could you please test the PV using a pod in K8S? e.g. create a pod in k8s to mount this PVC.
How do you perform this process? I tried to mount this storage on other machines, and it can be mounted successfully.
there is a storage service in the master node that manages the PV of job pods. please check if the status is ok.
rzxt@rzxt-openpai-master:~$ sudo docker ps -a | grep storage
e939a123aeed openpai/internal-storage-create "/bin/sh -c '/bin/ba…" 33 hours ago Up 33 hours k8s_internal-storage-create_internal-storage-create-ds-pqdtp_default_df78ae6e-fe06-4c64-b28b-6da551242810_0
e550c23113a7 gcr.io/google_containers/pause-amd64:3.1 "/pause" 33 hours ago Up 33 hours k8s_POD_internal-storage-create-ds-pqdtp_default_df78ae6e-fe06-4c64-b28b-6da551242810_0
@siames Please save the following pod spec to test.yaml. And use kubectl apply -f test.yaml to create it.
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- image: ubuntu:18.04
name: test-container
volumeMounts:
- name: test-nfs
mountPath: /data
command:
- "sleep 100d"
volumes:
- name: test-nfs
persistentVolumeClaim:
claimName: <test-nfs-pvc-name>
Please replace <test-nfs-pvc-name>
with your PVC's name.
@Siames, please check this issue: https://github.com/microsoft/pai/issues/4081, and the comments: https://github.com/microsoft/pai/issues/4081#issuecomment-569654558
When storage-server restart, some resources not be released in client side. We need to release them manually. You can follow the comment to fix this issue.
@Siames, please check this issue: #4081, and the comments: #4081 (comment)
When storage-server restart, some resources not be released in client side. We need to release them manually. You can follow the comment to fix this issue.
I directly mounted the storage when I submitted the jobs, and the job did not start, so there is no way to umount it.
@Siames Please save the following pod spec to test.yaml. And use kubectl apply -f test.yaml to create it.
apiVersion: v1 kind: Pod metadata: name: test spec: containers: - image: ubuntu:18.04 name: test-container volumeMounts: - name: test-nfs mountPath: /data command: - "sleep 100d" volumes: - name: test-nfs persistentVolumeClaim: claimName: <test-nfs-pvc-name>
Please replace
<test-nfs-pvc-name>
with your PVC's name.
Failed.
rzxt@rzxt-openpai-master:~$ nano test.yaml
rzxt@rzxt-openpai-master:~$ kubectl apply -f test.yaml
error: unable to recognize "test.yaml": an error on the server ("") has prevented the request from succeeding
rzxt@rzxt-openpai-master:~$ cat test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- image: ubuntu:18.04
name: test-container
volumeMounts:
- name: test-nfs
mountPath: /data
command:
- "sleep 100d"
volumes:
- name: test-nfs
persistentVolumeClaim:
claimName: titanxp32g3-volume
rzxt@rzxt-openpai-master:~$ nano test.yaml
rzxt@rzxt-openpai-master:~$ kubectl apply -f test.yaml
error: unable to recognize "test.yaml": an error on the server ("") has prevented the request from succeeding
rzxt@rzxt-openpai-master:~$ cat test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- image: ubuntu:18.04
name: test-container
volumeMounts:
- name: test-nfs
mountPath: /data
command:
- "sleep 100d"
volumes:
- name: test-nfs
persistentVolumeClaim:
claimName: titanxp32g3
rzxt@rzxt-openpai-master:~$
@siaimes, can you log into the problem node, find the NFS mount point and umount it manually? Or you can restart the problem worker node then check if this problem will be fixed.
@siaimes, can you log into the problem node, find the NFS mount point and umount it manually? Or you can restart the problem worker node then check if this problem will be fixed.
rzxt@rzxt-openpai-007:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 123M 3.1G 4% /run
/dev/sda2 220G 47G 163G 23% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sdb1 1.8T 5.0G 1.7T 1% /home
/dev/sda1 93M 3.6M 89M 4% /boot/efi
tmpfs 3.2G 0 3.2G 0% /run/user/1000
Please provide the kubelet log: you can get the log by running journalctl -u kubelet
Please provide the kubelet log: you can get the log by running
journalctl -u kubelet
This is the log of kubelet, please analyze it if it is convenient for you, thank you.
In the log, containers this error:
Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188 5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found
Please make sure the PVC titanxp32g3
exists. You can get all PVC by running kubectl get pvc
in dev-box
In the log, containers this error:
Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188 5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found
Please make sure the PVC
titanxp32g3
exists. You can get all PVC by runningkubectl get pvc
in dev-box
root@devbox:/# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
gtx1080ti1 Bound gtx1080ti1-pv 800Gi RWX 20h
gtx1080ti2 Bound gtx1080ti2-pv 1536Gi RWX 57d
k40m Bound k40m-pv 1536Gi RWX 57d
k80 Bound k80-pv 3Ti RWX 57d
rtx2080ti Bound rtx2080ti-pv 30Ti RWX 37d
share Bound share-pv 30Ti RWX 37d
titanxp32g1 Bound titanxp32g1-pv 1536Gi RWX 57d
titanxp32g2 Bound titanxp32g2-pv 1536Gi RWX 57d
titanxp32g3 Bound titanxp32g3-pv 1536Gi RWX 20h
titanxp32g4 Bound titanxp32g4-pv 800Gi RWX 57d
titanxp32g5 Bound titanxp32g5-pv 800Gi RWX 57d
titanxp64g1 Bound titanxp64g1-pv 1536Gi RWX 57d
titanxp64g2 Bound titanxp64g2-pv 800Gi RWX 57d
titanxp64g3 Bound titanxp64g3-pv 800Gi RWX 57d
In the log, containers this error:
Feb 24 21:07:45 rzxt-openpai-007 kubelet[5552]: E0224 21:07:45.609188 5552 desired_state_of_world_populator.go:298] Error processing volume "titanxp32g3-volume" for pod "b72fdd7a0f7e5d19812107762f386f9c-taskrole-0_default(85022050-bf85-42c4-b1b1-2ab9c98ce181)": error processing PVC "default"/"titanxp32g3": failed to fetch PVC default/titanxp32g3 from API server. err=persistentvolumeclaims "titanxp32g3" not found
Please make sure the PVC
titanxp32g3
exists. You can get all PVC by runningkubectl get pvc
in dev-box
I want to delete the pvc and reconfigure it, but this command will not return. After I forcibly end the command, I try to reconfigure it, but it also prompts it already exists.
oot@devbox:/cluster-configuration/storage# kubectl delete -f titanxp32g3.yaml
persistentvolume "titanxp32g3-pv" deleted
persistentvolumeclaim "titanxp32g3" deleted
^X^C
root@devbox:/cluster-configuration/storage# kubectl create -f titanxp32g3.yaml
Error from server (AlreadyExists): error when creating "titanxp32g3.yaml": object is being deleted: persistentvolumes "titanxp32g3-pv" already exists
Error from server (AlreadyExists): error when creating "titanxp32g3.yaml": object is being deleted: persistentvolumeclaims "titanxp32g3" already exists
root@devbox:/cluster-configuration/storage#
Please try this answer from stackoverflow: https://stackoverflow.com/questions/51358856/kubernetes-cant-delete-persistentvolumeclaim-pvc to delete PVC
Please try this answer from stackoverflow: https://stackoverflow.com/questions/51358856/kubernetes-cant-delete-persistentvolumeclaim-pvc to delete PVC
Thanks, I have reconfigured pvc, but the problem still exists. One thing to note is that it is possible to submit job with this PVC on a CPU-only machine.
Please attach the updated kubelet log. Maybe some library is missing in GPU node
Please attach the updated kubelet log. Maybe some library is missing in GPU node
here is the new log of kubelet. kubelet log
I have configured PVC for each node, and now more than half of my nodes have this error, but I think the failure is not caused by the PVC itself. The reason is that the task that I submit to using the PVC of the abnormal node on the normal node can be run successfully. However, abnormal nodes can run without PVC, and this error will occur when any PVC is used.
Can you mount NFS directly in abnormal nodes? Please runsudo mount -v -t nfs4 nfs-server_ip:/nfs_path ./target_path
And provide the output
Can you mount NFS directly in abnormal nodes? Please run
sudo mount -v -t nfs4 nfs-server_ip:/nfs_path ./target_path
And provide the output
rzxt@rzxt-openpai-007:~$ sudo mount -v -t nfs4 172.17.175.98:/ /data
[sudo] password for rzxt:
mount.nfs4: timeout set for Fri Feb 26 20:23:30 2021
mount.nfs4: trying text-based options 'addr=172.17.175.98,clientaddr=172.17.175.98'
rzxt@rzxt-openpai-007:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 203M 3.0G 7% /run
/dev/sda2 220G 47G 163G 23% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sdb1 1.8T 5.0G 1.7T 1% /home
/dev/sda1 93M 3.6M 89M 4% /boot/efi
tmpfs 3.2G 0 3.2G 0% /run/user/1000
172.17.175.98:/ 1.8T 5.0G 1.7T 1% /data
Can you provide the titanxp32g3
pv and pvc config, I mean the titanxp32g3.yaml
. The NFS mount works fine. Need to check if config is correct.
And please run kubectl describe pv titanxp32g3-pv
, kubectl describe pv titanxp32g3-pvc
for more information
Can you provide the
titanxp32g3
pv and pvc config, I mean thetitanxp32g3.yaml
. The NFS mount works fine. Need to check if config is correct. And please runkubectl describe pv titanxp32g3-pv
,kubectl describe pv titanxp32g3-pvc
for more information
root@devbox:/cluster-configuration/storage# cat titanxp32g3.yaml
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: titanxp32g3-pv
labels:
name: titanxp32g3
spec:
capacity:
storage: 1.5Ti
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
mountOptions:
- nfsvers=4.1
nfs:
path: /
server: 172.17.175.98
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: titanxp32g3
labels:
share: "false"
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 1.5Ti
selector:
matchLabels:
name: titanxp32g3
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3-pv
Name: titanxp32g3-pv
Labels: name=titanxp32g3
Annotations: pv.kubernetes.io/bound-by-controller: yes
Finalizers: [kubernetes.io/pv-protection]
StorageClass:
Status: Bound
Claim: default/titanxp32g3
Reclaim Policy: Retain
Access Modes: RWX
VolumeMode: Filesystem
Capacity: 1536Gi
Node Affinity: <none>
Message:
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 172.17.175.98
Path: /
ReadOnly: false
Events: <none>
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3-pvc
Error from server (NotFound): persistentvolumes "titanxp32g3-pvc" not found
root@devbox:/cluster-configuration/storage# kubectl describe pv titanxp32g3
Name: titanxp32g3-pv
Labels: name=titanxp32g3
Annotations: pv.kubernetes.io/bound-by-controller: yes
Finalizers: [kubernetes.io/pv-protection]
StorageClass:
Status: Bound
Claim: default/titanxp32g3
Reclaim Policy: Retain
Access Modes: RWX
VolumeMode: Filesystem
Capacity: 1536Gi
Node Affinity: <none>
Message:
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 172.17.175.98
Path: /
ReadOnly: false
Events: <none>
Maybe I found a temporary solution as follows:
sudo rm -r /var/lib/kubelet/pods/26aeceed-147b-4615-90d3-e19f2e235111 # for all Orphaned pods in kubelet log
Run this script repeatedly until all Orphaned pods are deleted.
#!/bin/bash
TEMP=$(sudo systemctl status kubelet.service grep Orphaned | grep -o "[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+-[0-9a-z]\+" | tail -1)
sudo rm -r /var/lib/kubelet/pods/${TEMP}
Update: This script has the risk of losing data.
Thanks @siaimes, it looks like an issue for k8s. I found there are many similar issues in k8s repo: https://github.com/kubernetes/kubernetes/issues/60987. We will keep track this issue.
Thanks @siaimes, it looks like an issue for k8s. I found there are many similar issues in k8s repo: kubernetes/kubernetes#60987. We will keep track this issue.
There was no power outage in my data center today, but several machines had the same failure. This temporary solution is not a long-term solution. I don't know why the k8s team has not fixed this bug. Can we fix this bug by ourselves?
I may have found the reason why this problem occurred without a power outage. The user submits a new job, and the job is stoped by the user when the image is being pulled. However, releasing resources at this time will fail.
rzxt@rzxt-openpai-015:~$ sudo systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2021-03-04 22:30:21 CST; 18min ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 5845 (kubelet)
Tasks: 0
Memory: 31.0M
CPU: 785ms
CGroup: /system.slice/kubelet.service
‣ 5845 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.17.175.106 --hostname-override=rzxt-openpai-015 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/
Mar 04 22:48:22 rzxt-openpai-015 kubelet[5845]: I0304 22:48:22.040315 5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [=> ] 41.85MB/1.608GB"
Mar 04 22:48:24 rzxt-openpai-015 kubelet[5845]: I0304 22:48:24.643073 5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:32 rzxt-openpai-015 kubelet[5845]: I0304 22:48:32.040302 5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [=> ] 56.88MB/1.608GB"
Mar 04 22:48:34 rzxt-openpai-015 kubelet[5845]: I0304 22:48:34.656844 5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.040283 5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "6d5dc792bcdf: Downloading [============> ] 216.6MB/885.7MB"
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.404006 5845 kubelet_getters.go:177] status for pod nginx-proxy-rzxt-openpai-015 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-03-02 11:41:05 +0800 CST } {Read
Mar 04 22:48:42 rzxt-openpai-015 kubelet[5845]: I0304 22:48:42.607538 5845 endpoint.go:111] State pushed for device plugin github.com/fuse
Mar 04 22:48:44 rzxt-openpai-015 kubelet[5845]: I0304 22:48:44.670189 5845 setters.go:73] Using node IP: "172.17.175.106"
Mar 04 22:48:52 rzxt-openpai-015 kubelet[5845]: I0304 22:48:52.040274 5845 kube_docker_client.go:342] Pulling image "nasir6/mmdetection:latest": "2d3634b2a47c: Downloading [==> ] 90.1MB/1.608GB"
Mar 04 22:48:54 rzxt-openpai-015 kubelet[5845]: I0304 22:48:54.681846 5845 setters.go:73] Using node IP: "172.17.175.106"
rzxt@rzxt-openpai-015:~$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─docker-dns.conf, docker-options.conf, http-proxy.conf
Active: active (running) since Thu 2021-03-04 22:29:02 CST; 20min ago
Docs: http://docs.docker.com
Main PID: 1594 (dockerd)
Tasks: 0
Memory: 119.7M
CPU: 319ms
CGroup: /system.slice/docker.service
‣ 1594 /usr/bin/dockerd --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --dns 10.192.0.3 --dn
Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371590386+08:00" level=warning msg="36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c cleanup: failed to unmount IPC: umount /mnt/docker/containers/36d245354901358b
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907114517+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907210410+08:00" level=warning msg="42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22 cleanup: failed to unmount IPC: umount /mnt/docker/containers/42302ad9fa0d102d
Mar 04 22:29:59 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:59.604464306+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668361253+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668447312+08:00" level=warning msg="dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff cleanup: failed to unmount IPC: umount /mnt/docker/containers/dae97114deceb0a5
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618282760+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618398068+08:00" level=warning msg="2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08 cleanup: failed to unmount IPC: umount /mnt/docker/containers/2d3f060ac46009b5
Mar 04 22:41:44 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:41:44.524715721+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:43:26 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:43:26.541074204+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
rzxt@rzxt-openpai-015:~$
Thanks for this reply. But the error from docker daemon seems not related to NFS. This happens when docker pull image and try to mount the image content. Can you using journalctl -u docker
get the full error logs? The message seems be truncated.
And NFS mount happens when container starts, k8s will mount the NFS share and map to container.
Thanks for this reply. But the error from docker daemon seems not related to NFS. This happens when docker pull image and try to mount the image content. Can you using
journalctl -u docker
get the full error logs? The message seems be truncated.And NFS mount happens when container starts, k8s will mount the NFS share and map to container.
-- Logs begin at Thu 2021-03-04 22:28:46 CST, end at Fri 2021-03-05 11:06:23 CST. --
Mar 04 22:29:01 rzxt-openpai-015 systemd[1]: Starting Docker Application Container Engine...
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.049809143+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.049849877+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050230391+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050239770+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050306787+08:00" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0 <nil>}]" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050324220+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050327669+08:00" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0 <nil>}]" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050350552+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050364824+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420080ee0, CONNECTING" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.050374626+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420932100, CONNECTING" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.052276400+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420080ee0, READY" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.052628335+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420932100, READY" module=grpc
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.371628743+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429221305+08:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429388997+08:00" level=warning msg="Your kernel does not support swap memory limit"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429424396+08:00" level=warning msg="Your kernel does not support cgroup rt period"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429431413+08:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.429908821+08:00" level=info msg="Loading containers: start."
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.728894651+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.846090675+08:00" level=info msg="Loading containers: done."
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.923707526+08:00" level=info msg="Docker daemon" commit=2d0083d graphdriver(s)=overlay2 version=18.09.7
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.924007478+08:00" level=info msg="Daemon has completed initialization"
Mar 04 22:29:02 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:02.931389900+08:00" level=info msg="API listen on /var/run/docker.sock"
Mar 04 22:29:02 rzxt-openpai-015 systemd[1]: Started Docker Application Container Engine.
Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371526341+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:38 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:38.371590386+08:00" level=warning msg="36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c cleanup: failed to unmount IPC: umount /mnt/docker/containers/36d245354901358bc4d39580596b5dcd3691280336313c5cfec9389dd8d02d8c/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907114517+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:29:58 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:58.907210410+08:00" level=warning msg="42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22 cleanup: failed to unmount IPC: umount /mnt/docker/containers/42302ad9fa0d102dd3b07ce884f4e9c5e369516c28b3a2359770d3f19fa44e22/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:29:59 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:29:59.604464306+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668361253+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:30:42 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:30:42.668447312+08:00" level=warning msg="dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff cleanup: failed to unmount IPC: umount /mnt/docker/containers/dae97114deceb0a564380f5a8a1217be78cbfd30289aede25133f0f082e35eff/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618282760+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:39:49 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:39:49.618398068+08:00" level=warning msg="2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08 cleanup: failed to unmount IPC: umount /mnt/docker/containers/2d3f060ac46009b50d01d7bafb31962572ae006e1a8f4eb906bfb204a564fd08/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 22:41:44 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:41:44.524715721+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:43:26 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:43:26.541074204+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 22:58:28 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T22:58:28.505896244+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:03:04 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:04.006664233+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:03:04 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:04.006812302+08:00" level=warning msg="89682056eb99c7cb2ecb0f34727e40deef16f2e3ed15bed43ab86592aeae2cbe cleanup: failed to unmount IPC: umount /mnt/docker/containers/89682056eb99c7cb2ecb0f34727e40deef16f2e3ed15bed43ab86592aeae2cbe/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 23:03:12 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:03:12.176132005+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.629548539+08:00" level=info msg="Container 5d0bbfd9f26f1c93c57bf65c0f63b01f3c3f362f1493622d0ea03a3986594aba failed to exit within 30 seconds of signal 15 - using the force"
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.743206847+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:04:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:04:18.871492949+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:33:09 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:09.009068853+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 04 23:33:09 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:09.009272168+08:00" level=warning msg="7b4b70bc38d0d887afbf071bf02184ec53c39616ebf67b3a70eba3d3dcf075c4 cleanup: failed to unmount IPC: umount /mnt/docker/containers/7b4b70bc38d0d887afbf071bf02184ec53c39616ebf67b3a70eba3d3dcf075c4/mounts/shm, flags: 0x2: no such file or directory"
Mar 04 23:33:18 rzxt-openpai-015 dockerd[1594]: time="2021-03-04T23:33:18.530873405+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
Mar 05 09:51:06 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:06.953423607+08:00" level=info msg="Container 80baa09ae76ea375f6c30854e784e6401f01e76e218fe0afcf8de48dff347ae5 failed to exit within 30 seconds of signal 15 - using the force"
Mar 05 09:51:07 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:07.233128555+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:51:07 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:51:07.359051322+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:52:11 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:11.129549057+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 09:52:11 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:11.129675875+08:00" level=warning msg="1c9499ff87f4c2e590e0cc5300268d977f024ab28e7849e4e68e6b125a19e113 cleanup: failed to unmount IPC: umount /mnt/docker/containers/1c9499ff87f4c2e590e0cc5300268d977f024ab28e7849e4e68e6b125a19e113/mounts/shm, flags: 0x2: no such file or directory"
Mar 05 09:52:14 rzxt-openpai-015 dockerd[1594]: time="2021-03-05T09:52:14.965708283+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
@Binyang2014 This problem occurs frequently. Can you continue to track this problem? Even if there is no power outage, this problem usually occurs.
Thank you loyally.
Sure, I will keep tracking. From the log you provided, I can not figure out the reason. Please provide the info about how you deploy NFS in your cluster. Seems your NFS server and NFS client in the same machine?
Sure, I will keep tracking. From the log you provided, I can not figure out the reason. Please provide the info about how you deploy NFS in your cluster. Seems your NFS server and NFS client in the same machine?
Thank you for your reply.
If the logs of other software are useful to you, I can also provide.
I installed NFS through sudo apt install nfs-kernel-server
.
Yes, I deployed the NFS service on each node, because I don't have a dedicated storage server, and the physical hard disks of each node are not very large.
When this problem occurs, the current node cannot mount the storage on any node, and other nodes can mount the storage of the current node, so the problem may not be caused by the NFS service itself.
Maybe we can integrate this script into openpai.
set -x
echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty."
source ${K8S_HOME}/bin/env.sh
IFS=$'\r\n'
for ((i=1; i<=5; i++));
do
orphanExist="false"
for podid in `grep "errors similar to this. Turn up verbosity to see them." /var/log/messages | tail -1 | awk '{print $12}' | sed 's/"//g'`;
do
echo $podid
# not process if the volume directory is not exist.
if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then
continue
fi
# umount subpath if exist
if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then
mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'`
for mntPath in $mountpath;
do
echo "umount subpath $mntPath"
umount $mntPath
done
fi
orphanExist="true"
volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/`
for volumeType in $volumeTypes;
do
subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType`
if [ "$subVolumes" != "" ]; then
echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType contents volume: $subVolumes"
for subVolume in $subVolumes;
do
# check subvolume path is mounted or not
findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
if [ "$?" != "0" ]; then
echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, just need to remove"
rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
else
echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is mounted, umount it"
umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
fi
done
else
rm -rf ${KUBELET_HOME}/kubelet/pods/$podid
fi
done
done
if [ "$orphanExist" = "false" ]; then
break
fi
sleep 2
done
This script may need to be modified appropriately according to openpai.
Maybe we can integrate this script into openpai.
set -x echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty." source ${K8S_HOME}/bin/env.sh IFS=$'\r\n' for ((i=1; i<=5; i++)); do orphanExist="false" for podid in `grep "errors similar to this. Turn up verbosity to see them." /var/log/messages | tail -1 | awk '{print $12}' | sed 's/"//g'`; do echo $podid # not process if the volume directory is not exist. if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then continue fi # umount subpath if exist if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'` for mntPath in $mountpath; do echo "umount subpath $mntPath" umount $mntPath done fi orphanExist="true" volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/` for volumeType in $volumeTypes; do subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType` if [ "$subVolumes" != "" ]; then echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType contents volume: $subVolumes" for subVolume in $subVolumes; do # check subvolume path is mounted or not findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume if [ "$?" != "0" ]; then echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is not mounted, just need to remove" rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume else echo "${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume is mounted, umount it" umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume fi done else rm -rf ${KUBELET_HOME}/kubelet/pods/$podid fi done done if [ "$orphanExist" = "false" ]; then break fi sleep 2 done
This script may need to be modified appropriately according to openpai.
Since the official k8s project did not fix the problem, our solution may monitor the problem and fix it. Maybe it can be integrated into the Prometheus system.
I think we should summarize the problem here:
Are these description correct?
I think we should summarize the problem here:
- When this issue happens, the problem node can not mount NFS though k8s, But Other node can mount it.
- If you mount NFS manually in problem node, the mount can be succeed.
- Kill the pod and start a new one can not fix this problem
- Restart the node can not fix this problem
Are these description correct?
Yes, it's exactly such a performance.
I had this failure on another machine, so I just had the opportunity to test the above script. I tested and passed the following script.
#!/bin/bash
#set -x
if [[ $(id -u) != 0 ]]; then
echo Please run this script as root.
exit 1
fi
echo "Start to scanning Orphaned Pod. Orphaned directory will be umounted if it is mounted, and will be removed if it is empty."
KUBELET_HOME="/var/lib"
IFS=$'\r\n'
orphanExist="true"
while [ "$orphanExist" = "true" ]
do
orphanExist="false"
podid=$(sudo systemctl status kubelet | grep "errors similar to this. Turn up verbosity to see them." | tail -1 | awk '{print $12}' | sed 's/"//g')
# not process if the no orphaned pod.
if [ "$podid" = "" ]; then
continue
fi
echo "podid=$podid"
# not process if the volume directory is not exist.
if [ ! -d ${KUBELET_HOME}/kubelet/pods/$podid/volumes/ ]; then
continue
fi
orphanExist="true"
# umount subpath if exist
if [ -d ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ ]; then
mountpath=`mount | grep ${KUBELET_HOME}/kubelet/pods/$podid/volume-subpaths/ | awk '{print $3}'`
for mntPath in $mountpath;
do
umount $mntPath
done
fi
volumeTypes=`ls ${KUBELET_HOME}/kubelet/pods/$podid/volumes/`
for volumeType in $volumeTypes;
do
subVolumes=`ls -A ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType`
if [ "$subVolumes" != "" ]; then
for subVolume in $subVolumes;
do
# check subvolume path is mounted or not
findmnt ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
if [ "$?" = "0" ]; then
# subVolume is mounted, umount it
umount ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
fi
# subVolume is not mounted now, just need to remove
rm -rf ${KUBELET_HOME}/kubelet/pods/$podid/volumes/$volumeType/$subVolume
done
else
rm -rf ${KUBELET_HOME}/kubelet/pods/$podid
fi
done
sleep 1
done
echo "Done."
After running this script, you must stop kubelet and restart the operating system, otherwise, there will be errors similar to the following and the task still cannot be started.
sudo systemctl stop kubelet
sudo init 6
When you met this issue again, please run sudo df -h
to check the filesystem type mounted by k8s.
And you can change the kubelet log level to 5 to see more detail log.
To change the log level, please edit /etc/kubernetes/kubelet.env
Change the KUBE_LOG_LEVEL="--v=5"
Then restart the kubelet
I will check the k8s code to find the root cause
When you met this issue again, please run
sudo df -h
to check the filesystem type mounted by k8s. And you can change the kubelet log level to 5 to see more detail log.To change the log level, please edit
/etc/kubernetes/kubelet.env
Change theKUBE_LOG_LEVEL="--v=5"
Then restart the kubeletI will check the k8s code to find the root cause
Hi @Binyang2014
Unfortunately, this problem has reappeared, so I am back. Below is the information generated according to your description.
rzxt@rzxt-openpai-007:~$ sudo df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 83M 3.1G 3% /run
/dev/sda2 220G 69G 141G 33% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda1 93M 3.6M 89M 4% /boot/efi
/dev/sdb1 1.8T 4.1G 1.7T 1% /home
overlay 220G 69G 141G 33% /mnt/docker/overlay2/bde1b548b7647a11350b286db850a09ddb51b1f9271c55dc33c7696311ff8bd5/merged
shm 64M 0 64M 0% /mnt/docker/containers/5dd6bdd0e923591aed3efe67bf9fe858feb738e69255a244fd814d8ce38d1d89/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/3f14f89c321293c1274399016f4c9d73b9ae0ea12361540ba9d3d2f79da986d8/merged
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/e4c89b91-b22f-4564-8fbf-197247ccdf59/volumes/kubernetes.io~secret/kube-proxy-token-x6hkb
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/e1f77ad3-b6a8-4d83-aa99-d9cb63eff083/volumes/kubernetes.io~secret/default-token-g82dc
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/92f7519a-5e68-4d9f-a1c2-412362edae4f/volumes/kubernetes.io~secret/default-token-pfhpt
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/61903405-4315-49c8-9a0c-efcf2417b9e4/volumes/kubernetes.io~secret/default-token-pfhpt
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/e8272832-d7d6-4d23-9fc7-12c432bf8fe5/volumes/kubernetes.io~secret/default-token-g82dc
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/d927dde1-5321-446b-aa92-42ae0978af7a/volumes/kubernetes.io~secret/calico-node-token-qmwg4
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/2b755705-2f47-4cee-bac0-fc1b9b6f6608/volumes/kubernetes.io~secret/default-token-g82dc
overlay 220G 69G 141G 33% /mnt/docker/overlay2/4982b0a0969f2fec6e6998a6d94c076f84bd67c5424e756799d302a6a570a260/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/1605115708c329115999fa815199cdc352941bd3a99551a45e0b67f712af61ab/merged
shm 64M 0 64M 0% /mnt/docker/containers/2139eb031d100fae4e6dcd05e68445f23ffa2daa61efd6f2e66f851dcb60c8be/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/ae97c18bb8169a2845a1c84ba148dddef7856b7c30d3daf9ed8ab6a9acf7d58e/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/4df7f6328369e6fb9e2c2610c6aff8ba791ba67ba7aa348bbe7b4e7f26c6ec75/merged
shm 64M 0 64M 0% /mnt/docker/containers/7c6599f1f671eeeb9500b9f3c2693710a7399a5417b820bbd2d22628a06bbc6a/mounts/shm
shm 64M 0 64M 0% /mnt/docker/containers/bff2debd471d27d9905ba1e127b519a194d1c9473cc289769e13abcf53d7a057/mounts/shm
shm 64M 0 64M 0% /mnt/docker/containers/2c23ce5b9147f60436290b42e7e0fe263c96628069fd7178d326d46ac102e021/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/7fa849fe778f6f2652a406631b562d6be4f0150187df55933b377e44f34c9ea5/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/8a6ec82212413a8bf37450bdd75e89e44e757eef5fc5f702272da0d4e223f9a6/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/ed0e98712008917c38efd8487d002c1fa47f7aa96da81410ba15c1ae713650c9/merged
shm 64M 0 64M 0% /mnt/docker/containers/7e28f7968ba49d613b4920e3c4bc9778bf40eeb41282646fd64d9b0a1386e8ad/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/da32b775df7ce45cf1a98bafa73bfd1358be35fdbaa1753d471d55a67634f87e/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/2a074f84d30a0630ca509a7e754d4fcd2530b27593bf64b1d8d1e917a024fa20/merged
shm 64M 0 64M 0% /mnt/docker/containers/7a36361713345af0f9ae5ea9937009ef87a2706b730864ccb40675f32b57b465/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/9fb22ddd226a8e167589cb1559666c2c7ee4d5fbb2bda5d8ac6ab6ace45cb058/merged
shm 64M 0 64M 0% /mnt/docker/containers/0d9f2312ca24da80b08746b8d18c2a50b016058b30ecaab2b8caa083582f5846/mounts/shm
overlay 220G 69G 141G 33% /mnt/docker/overlay2/b3227b9e291f14f53f9e5e7284b771144a74e3d9daf914404530f983674e0ee9/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/65cca2d13a2902fa3220c4352dc64def7ee9eb99994f028f13210b8e4c3b09b0/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/a02b5e89ecdf3872bde70b5d1451b29ef3a51c547a47548995e6f1c37c7bad38/merged
overlay 220G 69G 141G 33% /mnt/docker/overlay2/14a059aeec5d6fdd11c9dfff0e1919ec36aefc6482d9bae5655d71cf82c25519/merged
tmpfs 3.2G 0 3.2G 0% /run/user/1000
overlay 220G 69G 141G 33% /mnt/docker/overlay2/4c80c2df7df8e40f8b3f53b3e723458869ad315df8101c4c1da4a8a7c59034fd/merged
@siaimes from the log, seems after you restart the kubelet, this issue is disappeared. Maybe you can keep the log level to 5, then provide debug log when met this again. This will be great helpful.
And you can create an issue in kubernates repo, to see if those guys has any idea about this.
@siaimes from the log, seems after you restart the kubelet, this issue is disappeared. Maybe you can keep the log level to 5, then provide debug log when met this again. This will be great helpful.
Hi, @Binyang2014
I searched for the keywords Stopped Kubernetes Kubelet
and Server Orphaned
and found that the issue still exists after restart kubelet.
And you can create an issue in kubernates repo, to see if those guys has any idea about this.
There have been many similar issues in the Kubernetes project, but no fundamental solution has been obtained. The solution given by the community is what I currently use, that is, delete the issued volume and restart the operating system.
I suspect the problem is on the NFS side. For some reason, the pod cannot umount the NFS server successfully (e.g., https://askubuntu.com/questions/292043/how-to-unmount-nfs-when-server-is-gone). This potentially could be an issue when the storage hardware driver isn't implemented correctly. Please also check the log on the NFS server side to see if you can find some lead.
To find out the root cause, as suggested, please turn on the verbose mode in the kubelet log.
Also, to reduce the problem space, I suggest to remove the CNI following the instruction here: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-remove-the-k8s-network-plugin . OpenPAI does not require CNI.
I suspect the problem is on the NFS side. For some reason, the pod cannot umount the NFS server successfully (e.g., https://askubuntu.com/questions/292043/how-to-unmount-nfs-when-server-is-gone). This potentially could be an issue when the storage hardware driver isn't implemented correctly. Please also check the log on the NFS server side to see if you can find some lead.
It seems that the nfs service does not output any logs.
To find out the root cause, as suggested, please turn on the verbose mode in the kubelet log.
This log is output in verbose mode.
Also, to reduce the problem space, I suggest to remove the CNI following the instruction here: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-faqs-and-troubleshooting.html#how-to-remove-the-k8s-network-plugin . OpenPAI does not require CNI.
This introduction has a lot of operations, and it seems that every machine needs to do this operation once. If this is the case, it would be too risky to do it on a production server. Moreover, the last part of the introduction is not detailed enough to modify the configuration file.
Organization Name: Fuzhou University.
Short summary about the issue/question: An unexpected power outage caused the openpai system keep my jobs waitting.
I have tried restarting the k8s service, restarting the node, and restarting the entire openpai system, but it didn't help. Brief what process you are following: After an unexpected power outage, then submit job.
How to reproduce it:
OpenPAI Environment:
uname -a
): Linux rzxt-openpai-007 4.4.0-197-generic #229-Ubuntu SMP Wed Nov 25 11:05:42 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxAnything else we need to know: