Closed cg505 closed 2 weeks ago
I am trying this PR with the following test.yaml
sky launch -c test-mn --num-nodes 20 echo \$SKYPILOT_NODE_RANK --cloud aws --cpus 2
sky exec test-mn --num-nodes 10 test.yaml
resources:
cpus: 2+
setup: |
echo "setup"
run: |
while true; do
# Create a large array in memory
perl -e 'for($i=0;$i<10000000;$i++){push(@x,("A" x 4096));}while(1){};'
done
It shows in job output
Traceback (most recent call last):
File "/home/ubuntu/.sky/sky_app/sky_job_2", line 946, in <module>
returncodes = get_or_fail(futures, pg)
File "/home/ubuntu/.sky/sky_app/sky_job_2", line 51, in get_or_fail
returncodes[idx] = ray.get(ready[0])
File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/_private/worker.py", line 2640, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.31.84.253, ID: e35ddf45fcbb34adae0c87546ca19716a83d39e44a365e0988e2cdf0) where the task (task ID: 697cca582ea6b1cf2d697b6aeb9827872c10b70203000000, name=worker14, rank=6,, pid=1501, memory used=0.08GB) was running was 7.29GB / 7.60GB (0.959476), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 700170959b3a0272852101a6fdae5629cad81b155bccf963b2c5684e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.31.84.253`. To see the logs of the worker, use `ray logs worker-700170959b3a0272852101a6fdae5629cad81b155bccf963b2c5684e*out -ip 172.31.84.253. Top 10 memory users:
PID MEM(GB) COMMAND
1805 6.60 perl -e for($i=0;$i<10000000;$i++){push(@x,("A" x 4096));}while(1){};
1501 0.08 ray::worker14, rank=6,
1465 0.07 /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1417 0.04 /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1416 0.04 /home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_s...
1467 0.03 /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/skypilot-runtime/lib/python3.10/site-packag...
1787 0.01 /home/ubuntu/skypilot-runtime/bin/python /home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/...
1776 0.00 /bin/bash -i /tmp/sky_app_gz6pi9p1
1014 0.00 /lib/systemd/systemd --user
1774 0.00 /bin/sh -c /bin/bash -i /tmp/sky_app_gz6pi9p1
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Is this expected? I found sky queue test-mn
no longer work after it, I suppose it means OOM can still cause issue with the system?
@Michaelvll Does this succeed on master? I expect that you will see the same behavior, just without the error message.
I'm not sure why it would cause the issue with sky queue
but I don't think it's caused by the PR.
Should fix #4199.
Note: we can't use the subprocess log filtering in _handle_io_stream because the stderr output we see is from the ray invocation, within the run script itself (not in the subprocess).
Tested (run the relevant ones):
bash format.sh