ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.04k stars 5.46k forks source link

[Ray core] misleading error message about ray worker OOM #42864

Open gtarcoder opened 5 months ago

gtarcoder commented 5 months ago

What happened + What you expected to happen

my ray job often fails with following events:

image

All of those events say "Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1. The worker may have exceeded K8s pod memory limits. The process receives a SIGTERM" Seems those workers failed due to OOM.

But after investigation, I find there is no OOM issue, here are the two evidences:

  1. Login to one of the k8s pod and execute 'dmesg -T' command, there is no oom-kill info related with my ray job/actor/task/worker...

  2. Use promql (avg by (pod_name,xhs_zone) (container_memory_working_set_bytes{pod_name=~"redray-search-mmoe-bv3-.*"}) / avg by (pod_name,xhs_zone) (kube_pod_container_resource_limits_memory_bytes{pod_name=~"redray-search-mmoe-bv3-.*"})) to calculate memory usage of k8s pod, pod memory usage is not too high:

image

So, I think ray job's error message is not correct, but I do not know where the problem is.

Versions / Dependencies

ray 2.9.1

Reproduction script

no

Issue Severity

High: It blocks me from completing my task.

gtarcoder commented 5 months ago

any suggestion? @bveeramani @ericl

bveeramani commented 5 months ago

I'm not sure if I'm the best person to help here. @rynewang would you mind taking a look?

rynewang commented 5 months ago

the ground truth is that it received a SIGTERM. "k8s oom" is a speculation that may not, and in this case, is not true. Ray does not use SIGTERM to kill workers, so the signal comes from somewhere else. Do you have some guesses on who may send SIGTERM to the workers?

gtarcoder commented 5 months ago

the ground truth is that it received a SIGTERM. "k8s oom" is a speculation that may not, and in this case, is not true. Ray does not use SIGTERM to kill workers, so the signal comes from somewhere else. Do you have some guesses on who may send SIGTERM to the workers?

went through all the logs of ray actor/tasks, can not find any message useful to this SIGTERM..

jjyao commented 5 months ago

@gtarcoder any suggestions on how we can improve the error message to make it clear that it's only a guess?

jjyao commented 5 months ago

Verify the behaviors when pod is oom killed and when oom killer inside pod kills worker process. Then change the error message to match the behavior.

gtarcoder commented 5 months ago

Verify the behaviors when pod is oom killed and when oom killer inside pod kills worker process. Then change the error message to match the behavior.

maybe it's better to remove this hint "The worker may have exceeded K8s pod memory limits.".

btw, it's hard enough to debug ray job