ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.23k stars 5.81k forks source link

[ray] Handle memory pressure more gracefully #6458

Open petrock99 opened 4 years ago

petrock99 commented 4 years ago

What is the problem?

I'm using Ray 0.7.6 + Python 3.7.3 with 45 linux machines on my university network as a cluster. All students in my department have access to these machines and use them frequently. If one of these students does something to consume more than 95% of the available memory on any of the nodes (e.g. open a 13GB file) Ray will throw up it's hands and quit. This is quite frustrating, especially if I'm 4 hours into an 8 hour run. It would be nice if Ray handled memory pressure a little more gracefully. Especially if that memory pressure is being caused by another users that I have no control over. Couple ideas:

Reproduction

Expected: Ray process should continue with out issues, all be it a little slower Actual: Ray process tears itself down and quits.

petrock99 commented 4 years ago

Hit this issue again today. Someone decided to open a 30GB file on one of the worker nodes and ray killed itself. raytracer is not affiliated with ray of course.

28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm

Here is the output from ray:

Traceback (most recent call last): File “path/to/my/python/script", line 471, in nnet.train(train_loader, train_dataset.max_len(), n_epochs, learning_rate) File "path/to/my/python/script_", line 257, in train running_state_dict,n_batches, progress, bar, verbose) File "path/to/my/python/script_", line 185, in wait_for_training_data training_data = ray.get(training_id) File "path/to/my/home/.local/lib/python3.7/site-packages/ray/worker.py", line 2121, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RayOutOfMemoryError): ray_RemoteBatchNetHelper:train() (pid=28718, host=jaguar) File "path/to/my/home/.local/lib/python3.7/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory self.error_threshold)) ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node jaguar is used (31.24 / 31.27 GB). The top 10 memory consumers are:

PID MEM COMMAND 28779 29.92GiB ./raytracer HolidayScene/1002.txt HolidayScene/1002.ppm 28718 0.06GiB ray_RemoteBatchNetHelper:train() 28719 0.06GiB ray_worker 21854 0.04GiB ristretto /s/bach/c/under/jhgrins/cs410/P5_test/scene1.gif 11835 0.02GiB /usr/libexec/sssd/sssd_kcm --uid 0 --gid 0 --logger=files 13314 0.01GiB /opt/google/chrome-beta/chrome --type=renderer --disable-webrtc-apm-in-audio-service --field-trial-h 14493 0.01GiB path/to/my/home/.local/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_ 14494 0.01GiB /usr/local/anaconda-2019.07/bin/python -u path/to/my/home/.local/lib/python3.7/site-package 12924 0.01GiB /opt/google/chrome-beta/chrome 12967 0.0GiB /opt/google/chrome-beta/chrome --type=utility --field-trial-handle=5488568897404811202,1067882724045

In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray, and the max Redis size with redis_max_memory. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value. 2019-12-12 16:03:37,066 INFO node_provider.py:41 -- ClusterState: Loaded cluster state: [list of nodes] 2019-12-12 16:03:37,067 INFO commands.py:110 -- teardown_cluster: Shutting down 13 nodes...

ericl commented 4 years ago

One thing that can help here is to, if you know for sure your workload doesn't use a lot of memory, set RAY_DEBUG_DISABLE_MEMORY_MONITOR=1. This will disable memory checking entirely. However, note that this can lead to very confusing error messages if you do run into real memory contention, since Ray cannot always return good error messages if it runs out of memory for real.

virtualluke commented 4 years ago

Setting RAY_DEBUG_DISABLE_MEMORY_MONITOR=1 is a little like turning up the radio in the car when you have some engine noise you want to forget about.

I think this would be a very nice feature to make ray more robust. Long way of me adding a "thumbs up" to the issue.