microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

[bug] Containers cannot exit normally, but frameworks exit and reallocate the resources they occupy。 #2152

Closed hirolsj closed 5 years ago

hirolsj commented 5 years ago

Organization Name: NEL-BITA

Short summary about the issue/question:

Due to hardware, OS level or docker problems, containers cannot exit normally when the framework exits.

The key problem is that the resources occupied by these containers (gpu, port and memory) are reallocated, resulting in frequent errors in user submission tasks. For example, GPU oom and occupied ports.

Therefore, our requirement is to improve the framework exit mechanism, that is, to detect containers that do not exit normally and isolate the resources they occupy.

fanyangCS commented 5 years ago

what is the version you are using? we have fixed many of the issues in 0.9.y.

hirolsj commented 5 years ago

0.8.3

yqwang-ms commented 5 years ago

I remember NM will detect unmanged GPU and Port usage, and avoid to allocated resource on them. @mzmssg for the YARN related investigation.

fanyangCS commented 5 years ago

@hirolsj, can we sync it privately. we'd like to have the first hand information (e.g., ssh to the servers) to understand what is going on.

hirolsj commented 5 years ago

Ok, contact me shengjun@leinao.ai by email.

scarlett2018 commented 5 years ago

@hirolsj - 关于问题的在线讨论,请加入https://gitter.im/Microsoft/pai, Gitter有iOS and Android客户端,国内也能访问。PAI Dev会在Gitter Chat Room answer questions.

fanyangCS commented 5 years ago

https://github.com/Microsoft/pai/pull/1646 this is the PR addressing zombie container issue.

fanyangCS commented 5 years ago

the customer reports the process in a job container goes to D state because of uninterruptible-sleep due to I/O ops, probably due to NFS issue. https://eklitzke.org/uninterruptible-sleep. When this happened, the Docker container cannot exit, and Docke itself is in unhealthy state. Admin has to reboot the server to fix the issue.

hirolsj commented 5 years ago

Thanks!

mzmssg commented 5 years ago

A workaround to avoid rebooting is killing the shimd process(The parent process of container root process), then host initd will adopt the D processes and container can exit.

Of course it's not recommanded

mzmssg commented 5 years ago

Containers should be in some status like stopping in this case. But currently it finished directly.

@yqwang-ms GPU conflict is because NM only detect memory leak, no other detection. Port conflict is a known bug to fix.https://github.com/Microsoft/pai/issues/1983

yqwang-ms commented 5 years ago

@mzmssg I synced with Qingcha who implement GPU and ported the Port codes, he said NM has the GPU and Port detection. And for the Port issue, it should be easy to fix, please take a look.

yqwang-ms commented 5 years ago

Double synced with Qingcha, we confirmed that current unmanaged GPU detection is not work.

mzmssg commented 5 years ago

@hirolsj More details to help you understand this topic: Currently, we have 2 docker cotaniners associated to 1 yarn container. One is yarn container script(contaienr A), the other is the real job container(container B). When killing job, YARN only care about A and kill it, then B could detect it, suicide and release resource. The issue happens because B's suicide is blocked by kernal mode processes.

Back to your question: why can't we keep container before resource released. Because we rely on YARN, but YARN only best-effort kill container with SIGKILL. As you know, it doesn't work For 'D' processes.

fanyangCS commented 5 years ago

According to @mzmssg, here is a possible fix. 1) in Yarn NM, do not change the state to container-completed, until NM makes sure the Yarn container quits. Currently NM sends a SIGKILL and it believes the YarnContainer must exit, so it changes the state without making sure the YarnContainer state. (that's why it is best-effort). Normally, this is ok but not in your case. A possible solution is to modify NM's state machine and let NM waits in a state until the YarnContainer exits.

2) in YarnContainer, do not quit until it makes sure DockerContainer quit. Unfortunately, YarnContainer will be terminated if NM sends a SIGKILL. There is no way to ignore it . One way around is to modify NM so that it does not send SIGKILL. (this is dangerous in general but ok if you only run YarnContainerScript in PAI) .

For 1), we need to be careful to ensure the fix on state machine of NM is correct. This requires time. For 2), the fix will reduce the generality of PAI: the fix implies the container managed by NM will honor other SIGNAL (e.g., SIGTERM). but malicious container can exploit this to become a rogue container: it can hold on the resource infinitely.

My suggestion is to fix the uninterrupted-sleep issue, rather than applying 1 and 2. If you cannot fix that, 1 and 2 can be a workaround, but we cannot take it as a general solution at current stage. The recommended solution is to have an alert for such zombie container. Once you receive the alert, you decommission the node, reboot and recommission it. Although there is a chance for YARN to mistakenly schedule a job on that node before you decommission it, the job will get retried by system automatically (without burning the retrycount).

Does that make sense?

yqwang-ms commented 5 years ago

BTW, NM is NOT so "best effort", if NM does not restart, NM will wait until it has confirmed that the root process (i.e. the container A mentioned by mzmssg above) is exited, then tell RM to release its resource.

Another option is to detect the unmanaged resource which is not used by YARN, and RM will not allocate container for these resource.

hirolsj commented 5 years ago

@fanyangCS Ok. It's a reasonable solution. It would be nice to have a alert and a tool (decommission/recommission nodes) for such cases.

Thanks!

mzmssg commented 5 years ago

@hirolsj The decommission tool will be added in v0.10. We are focusing on alert now.

@yqwang-ms It's an option, but there is a latency for the hardware detection, so jobs still might be scheduled to this node.

yqwang-ms commented 5 years ago

I have not dived into code, but seems not really, the total utilized resource are always unchanged, and RM should not always allocate resource on it. So, I cannot see latency or race condition here from my first sight, you can dive into code if you have time, and point out where is the latency you mentioned.

mzmssg commented 5 years ago

the total utilized resource are always unchanged, and RM should not always allocate resource on it

The total utilized resource means utilized by external processes, such as zombie container or host processes? If so, it should be changed by hardware detection.

Or I misunderstand something?

yqwang-ms commented 5 years ago

The total utilized resource means the machine total utilization, including all utilizations on the machine, such as container utilization + YARN external utilization + YARN itself utilization, etc.

yqwang-ms commented 5 years ago

The "always" word, I mean, the duration that a managed YARN container becomes a unmanaged external processes.

And for this case, before NM wrongly release container, Total Used (1) = Container GPU Used (1) + External Used (0), after NM wrongly release container, Total Used (1) = Container GPU Used (0) + External Used (1). You can see, the Total Used (1) is unchanged, and across all the time, RM should not allocated new container on the already "Total Used (1)".

mzmssg commented 5 years ago

@yqwang-ms That's the point, after NM wrongly release container, we need the transistion is

(container[1], external[0]) ->(container[0], external[1])

But actually it might be

(container[1], external[0]) -> (container[0], external[0]) -> (container[0], external[1])

The fisrt -> is NM release container, the second -> is NM detect the zombie container.

I remember that NM implementaion is: One thread to detect hardware and store in memory. Another thread to heart-beat the info to RM.

If the heart-beat thread runs after releasing container, but before detection thread, it would be in the intermedia state (container[0], external[0]).

yqwang-ms commented 5 years ago

Note that you always directly get Total Used from OS system, then External Used is got from Total Used - Container Used. Anyway, the Total Used is unchanged. The intermedia state (container[0], external[0]). can be avoided, because you can send Total Used to RM, and in any time, the Total Used is unchanged.

mzmssg commented 5 years ago

I get you, so NM only report utilized GPU, RM union it with allocated GPU. Then the supplementary is available for scheduler?

yqwang-ms commented 5 years ago

Yes

yqwang-ms commented 5 years ago

BTW, I do not know you "only report" mean, suggest to keep current report info and add new if required.

fanyangCS commented 5 years ago

close as no response for a long time.