Open gtarcoder opened 5 months ago
any suggestion? @bveeramani @ericl
I'm not sure if I'm the best person to help here. @rynewang would you mind taking a look?
the ground truth is that it received a SIGTERM. "k8s oom" is a speculation that may not, and in this case, is not true. Ray does not use SIGTERM to kill workers, so the signal comes from somewhere else. Do you have some guesses on who may send SIGTERM to the workers?
the ground truth is that it received a SIGTERM. "k8s oom" is a speculation that may not, and in this case, is not true. Ray does not use SIGTERM to kill workers, so the signal comes from somewhere else. Do you have some guesses on who may send SIGTERM to the workers?
went through all the logs of ray actor/tasks, can not find any message useful to this SIGTERM..
@gtarcoder any suggestions on how we can improve the error message to make it clear that it's only a guess?
Verify the behaviors when pod is oom killed and when oom killer inside pod kills worker process. Then change the error message to match the behavior.
Verify the behaviors when pod is oom killed and when oom killer inside pod kills worker process. Then change the error message to match the behavior.
maybe it's better to remove this hint "The worker may have exceeded K8s pod memory limits.".
btw, it's hard enough to debug ray job
What happened + What you expected to happen
my ray job often fails with following events:
All of those events say "Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1. The worker may have exceeded K8s pod memory limits. The process receives a SIGTERM" Seems those workers failed due to OOM.
But after investigation, I find there is no OOM issue, here are the two evidences:
Login to one of the k8s pod and execute 'dmesg -T' command, there is no oom-kill info related with my ray job/actor/task/worker...
Use promql
(avg by (pod_name,xhs_zone) (container_memory_working_set_bytes{pod_name=~"redray-search-mmoe-bv3-.*"}) / avg by (pod_name,xhs_zone) (kube_pod_container_resource_limits_memory_bytes{pod_name=~"redray-search-mmoe-bv3-.*"}))
to calculate memory usage of k8s pod, pod memory usage is not too high:So, I think ray job's error message is not correct, but I do not know where the problem is.
Versions / Dependencies
ray 2.9.1
Reproduction script
no
Issue Severity
High: It blocks me from completing my task.