Closed Binyang2014 closed 3 years ago
@binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.
@Binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.
We may need to mark these pods as critical to achieve this: https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
Checked the QoS class. Currently, job-exporter, node-exporter qos are Burstable
, log-manager qos class is BestEffort
and pod user job qos class is Guaranteed
.
For k8s node eviction, BestEffort
will be first evicted. Guaranteed
and BestEffort
pods whose usage is beneath requests are evicted last. (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/?spm=a2c65.11461447.0.0.6a497eafo7oGQp#evicting-end-user-pods)
For Guaranteed
and BestEffort
pods which resource usage is not exceed their requests, will ordered by pod priority.
For this case, the resource is disk, and all pod don't claim the requests for the disk. So the eviction order will rank by pod priority, then resource usage. We can get the eviction order from the log:
eviction manager: must evict pod(s) to reclaim ephemeral-storage
eviction_manager.go:362] eviction manager: pods ranked for eviction: user-job-pod, job-exporter-zgf4j_default(e9e0a4a5-660a-493b-ac4b-a95f8977867a),
nginx-proxy-prodk80bg000012_kube-system(6d20234d7a8eda76fb23d52d6f743b77),
log-manager-ds-9tflc_default(67956c0d-ba89-4aa9-a14e-ed2fbbcd915e),
k8s-host-device-plugin-daemonset-kqh67_kube-system(8853e257-8f09-46ca-b372-2632cb94eea5),
blobfuse-flexvol-installer-vrqdg_kube-system(dd5a903a-9a38-4f27-b3d8-00bed955c9e9),
node-exporter-2ks4s_default(326bb034-a011-4960-baef-6eb6fa7c9f24),
kube-proxy-l7gzb_kube-system(82351f89-30b4-4a20-b679-0a15f213b999),
nvidia-device-plugin-daemonset-j2f8z_kube-system(9d98eaed-cc87-43c4-a514-82c147833843)
Since user job usually consume more disk, it will be evicted first. But we use hostPath for the log folder, evict user job will not solve the problem. Then, kubelet continue to evict pai service pod.
To leverage k8s eviction policy to avoid disk pressure, we'd better not store job logs in each host. It's better to use default k8s log mechanism, then use fluend store log into centralized storage-server. (https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures)
And our log-manger is mis-configured. It not rotate the log according to size, but according to time. After reconfigure the log-manager and fix some bugs, this issue can be mitigated.
@Binyang2014 Can we update the deployment script to avoid future misconfiguration? (At least we should update the document)
Moreover, it seems we need the following:
Anything more?
Since the QoS class is assigned by k8s according to pod resource request/limit
. We can't change the job pod QoS class. (Change the resource request/limit
value may affect scheduler). We can keep job pod QoS class as Guaranteed
, and use pod priority to control the eviction rank.
We can do following:
Closed, this issue already fixed
PAI will keep user job log under /var/log/pai. If user job write too much logs, it will cause machine disk pressure.
We need to: