microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

User job write too much logs will cause disk pressure #4694

Closed Binyang2014 closed 3 years ago

Binyang2014 commented 4 years ago

PAI will keep user job log under /var/log/pai. If user job write too much logs, it will cause machine disk pressure.

We need to:

  1. Make the log path configurable, then we can store user log into a large disk
  2. Investigate how to kill such offence job
fanyangCS commented 4 years ago

relate to https://github.com/microsoft/pai/issues/3765 and https://github.com/microsoft/pai/issues/3340

fanyangCS commented 4 years ago

@binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.

Binyang2014 commented 4 years ago

@Binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted.

We may need to mark these pods as critical to achieve this: https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

Binyang2014 commented 4 years ago

Checked the QoS class. Currently, job-exporter, node-exporter qos are Burstable, log-manager qos class is BestEffort and pod user job qos class is Guaranteed. For k8s node eviction, BestEffort will be first evicted. Guaranteed and BestEffort pods whose usage is beneath requests are evicted last. (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/?spm=a2c65.11461447.0.0.6a497eafo7oGQp#evicting-end-user-pods)

For Guaranteed and BestEffort pods which resource usage is not exceed their requests, will ordered by pod priority.

For this case, the resource is disk, and all pod don't claim the requests for the disk. So the eviction order will rank by pod priority, then resource usage. We can get the eviction order from the log:

eviction manager: must evict pod(s) to reclaim ephemeral-storage
eviction_manager.go:362] eviction manager: pods ranked for eviction: user-job-pod, job-exporter-zgf4j_default(e9e0a4a5-660a-493b-ac4b-a95f8977867a),
nginx-proxy-prodk80bg000012_kube-system(6d20234d7a8eda76fb23d52d6f743b77),
log-manager-ds-9tflc_default(67956c0d-ba89-4aa9-a14e-ed2fbbcd915e),
k8s-host-device-plugin-daemonset-kqh67_kube-system(8853e257-8f09-46ca-b372-2632cb94eea5),
blobfuse-flexvol-installer-vrqdg_kube-system(dd5a903a-9a38-4f27-b3d8-00bed955c9e9),
node-exporter-2ks4s_default(326bb034-a011-4960-baef-6eb6fa7c9f24),
kube-proxy-l7gzb_kube-system(82351f89-30b4-4a20-b679-0a15f213b999),
nvidia-device-plugin-daemonset-j2f8z_kube-system(9d98eaed-cc87-43c4-a514-82c147833843)

Since user job usually consume more disk, it will be evicted first. But we use hostPath for the log folder, evict user job will not solve the problem. Then, kubelet continue to evict pai service pod.

To leverage k8s eviction policy to avoid disk pressure, we'd better not store job logs in each host. It's better to use default k8s log mechanism, then use fluend store log into centralized storage-server. (https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures)

Binyang2014 commented 4 years ago

And our log-manger is mis-configured. It not rotate the log according to size, but according to time. After reconfigure the log-manager and fix some bugs, this issue can be mitigated.

fanyangCS commented 4 years ago

@Binyang2014 Can we update the deployment script to avoid future misconfiguration? (At least we should update the document)

Moreover, it seems we need the following:

  1. Set the QoS class of job pod to the "lowest" (BestEffort); Set the QoS class of log-manager to Burstable (same as other OpenPAI services)
  2. To avoid mis-eviction of the wrong job pods, we still need a watchdog to kill the offending job pod
  3. Leverage k8s log mechanism

Anything more?

Binyang2014 commented 4 years ago

Since the QoS class is assigned by k8s according to pod resource request/limit. We can't change the job pod QoS class. (Change the resource request/limit value may affect scheduler). We can keep job pod QoS class as Guaranteed, and use pod priority to control the eviction rank.

We can do following:

  1. Set the QoS class of log-manager to Burstable, and fix some config error of logrotate.
  2. Give pai service pod higher priority than job pod, make sure them will not be evicted before job pod.
  3. A watch dog to kill the offending job pod / Leverage k8s log mechanism. If we leverage k8s log mechanism, k8s will help us kill the offending job.
Binyang2014 commented 3 years ago

Closed, this issue already fixed