Open petrock99 opened 4 years ago
Hit this issue again today. Someone decided to open a 30GB file on one of the worker nodes and ray killed itself. raytracer is not affiliated with ray of course.
28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm
Here is the output from ray:
Traceback (most recent call last):
File “path/to/my/python/script", line 471, in
PID MEM COMMAND 28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm 28718 0.06GiB ray_RemoteBatchNetHelper:train() 28719 0.06GiB ray_worker 21854 0.04GiB ristretto /s/bach/c/under/jhgrins/cs410/P5_test/scene1.gif 11835 0.02GiB /usr/libexec/sssd/sssd_kcm --uid 0 --gid 0 --logger=files 13314 0.01GiB /opt/google/chrome-beta/chrome --type=renderer --disable-webrtc-apm-in-audio-service --field-trial-h 14493 0.01GiB path/to/my/home/.local/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_ 14494 0.01GiB /usr/local/anaconda-2019.07/bin/python -u path/to/my/home/.local/lib/python3.7/site-package 12924 0.01GiB /opt/google/chrome-beta/chrome 12967 0.0GiB /opt/google/chrome-beta/chrome --type=utility --field-trial-handle=5488568897404811202,1067882724045
In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory
parameter when starting Ray, and the max Redis size with redis_max_memory
. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
2019-12-12 16:03:37,066 INFO node_provider.py:41 -- ClusterState: Loaded cluster state: [list of nodes]
2019-12-12 16:03:37,067 INFO commands.py:110 -- teardown_cluster: Shutting down 13 nodes...
One thing that can help here is to, if you know for sure your workload doesn't use a lot of memory, set RAY_DEBUG_DISABLE_MEMORY_MONITOR=1
. This will disable memory checking entirely. However, note that this can lead to very confusing error messages if you do run into real memory contention, since Ray cannot always return good error messages if it runs out of memory for real.
Setting RAY_DEBUG_DISABLE_MEMORY_MONITOR=1 is a little like turning up the radio in the car when you have some engine noise you want to forget about.
I think this would be a very nice feature to make ray more robust. Long way of me adding a "thumbs up" to the issue.
What is the problem?
I'm using Ray 0.7.6 + Python 3.7.3 with 45 linux machines on my university network as a cluster. All students in my department have access to these machines and use them frequently. If one of these students does something to consume more than 95% of the available memory on any of the nodes (e.g. open a 13GB file) Ray will throw up it's hands and quit. This is quite frustrating, especially if I'm 4 hours into an 8 hour run. It would be nice if Ray handled memory pressure a little more gracefully. Especially if that memory pressure is being caused by another users that I have no control over. Couple ideas:
Reproduction
Expected: Ray process should continue with out issues, all be it a little slower Actual: Ray process tears itself down and quits.