skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

do not redirect stderr to /dev/null when submitting job #4247

Closed cg505 closed 2 weeks ago

cg505 commented 2 weeks ago

Should fix #4199.

Note: we can't use the subprocess log filtering in _handle_io_stream because the stderr output we see is from the ray invocation, within the run script itself (not in the subprocess).

Tested (run the relevant ones):

Michaelvll commented 2 weeks ago

I am trying this PR with the following test.yaml sky launch -c test-mn --num-nodes 20 echo \$SKYPILOT_NODE_RANK --cloud aws --cpus 2 sky exec test-mn --num-nodes 10 test.yaml

resources:
    cpus: 2+

setup: |
    echo "setup"

run: |
    while true; do
        # Create a large array in memory
        perl -e 'for($i=0;$i<10000000;$i++){push(@x,("A" x 4096));}while(1){};'
    done

It shows in job output

Traceback (most recent call last):
  File "/home/ubuntu/.sky/sky_app/sky_job_2", line 946, in <module>
    returncodes = get_or_fail(futures, pg)
  File "/home/ubuntu/.sky/sky_app/sky_job_2", line 51, in get_or_fail
    returncodes[idx] = ray.get(ready[0])
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/worker.py", line 2640, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.84.253, ID: e35ddf45fcbb34adae0c87546ca19716a83d39e44a365e0988e2cdf0) where the task (task ID: 697cca582ea6b1cf2d697b6aeb9827872c10b70203000000, name=worker14, rank=6,, pid=1501, memory used=0.08GB) was running was 7.29GB / 7.60GB (0.959476), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 700170959b3a0272852101a6fdae5629cad81b155bccf963b2c5684e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.84.253`. To see the logs of the worker, use `ray logs worker-700170959b3a0272852101a6fdae5629cad81b155bccf963b2c5684e*out -ip 172.31.84.253. Top 10 memory users:
PID     MEM(GB) COMMAND
1805    6.60    perl -e for($i=0;$i<10000000;$i++){push(@x,("A" x 4096));}while(1){};
1501    0.08    ray::worker14, rank=6,
1465    0.07    /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1417    0.04    /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1416    0.04    /home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_s...
1467    0.03    /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1787    0.01    /home/ubuntu/skypilot-runtime/bin/python /home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/...
1776    0.00    /bin/bash -i /tmp/sky_app_gz6pi9p1
1014    0.00    /lib/systemd/systemd --user
1774    0.00    /bin/sh -c /bin/bash -i /tmp/sky_app_gz6pi9p1
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Is this expected? I found sky queue test-mn no longer work after it, I suppose it means OOM can still cause issue with the system?

cg505 commented 2 weeks ago

@Michaelvll Does this succeed on master? I expect that you will see the same behavior, just without the error message.

I'm not sure why it would cause the issue with sky queue but I don't think it's caused by the PR.