ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.93k stars 5.77k forks source link

[ray] ray on slurm not respecting memory limits #6968

Open ocorcoll opened 4 years ago

ocorcoll commented 4 years ago

What is the problem?

When running a ray script in Slurm (single node), it seems that ray is not respecting the memory limitations specified in ray.init. As I understand, the below script should fail with some memory limit error from Ray but instead is the cluster that fails. I book 40GB from the cluster and limit Ray memory to 1GB for workers and 1GB for the object store. It seems to consume the 40GB (surprisingly in 30 min)

Ray version: 0.8.0

Reproduction (REQUIRED)

import ray

@ray.remote
class Store:
    def __init__(self):
        self.storage = list()

    def add(self, item):
        self.storage.append(item)

if __name__ == '__main__':
    ray.init(memory=int(1 * 1e9), object_store_memory=int(1 * 1e9))
    storage = Store.remote()
    while True:
        storage.add.remote(5)

and the slurm job script:

#!/bin/bash

#SBATCH -J jobname
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task 8
#SBATCH --mem=40GB
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=120:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1

module load gcc-7.1.0
module load cuda/10.0
module load cudnn/7.6.3/cuda-10.0
srun $@

Run command:

sbatch slurm_run_ray.sh python ray_slurm_test.py

Running the previous script produces the following Slurm error:

2020-01-30 00:59:03,618 INFO resource_spec.py:216 -- Starting Ray with 0.93 GiB memory available for workers and up to 0.93 GiB for objects. You can adjust these settings with  ray.init(memory=<bytes>, object_store_memory=<bytes>).

slurmstepd: error: Step 7580752.0 exceeded memory limit (42349772 > 41943040), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7580752.0 ON falcon1 CANCELLED AT 2020-01-30T01:33:31 ***
srun: error: falcon1: task 0: Killed
ericl commented 4 years ago

Ray doesn't actually enforce limits on memory -- that is only a hint for scheduling purposes (and only if you specify memory requests in task decorators). So it is up to your application to not exceed its memory.

richardliaw commented 4 years ago

@ocorcoll one thing that we had to do with YARN was actually also set ray.init(driver_object_store_memory=DRIVER_MEMORY...).

ocorcoll commented 4 years ago

Thanks @richardliaw, that together with setting memory per actor .options(num_cpus=1,memory=int(1e9)) seems to avoid memory issues. Since Ray does not limit the amount of memory consumed, I wonder why setting this attribute makes this issue go away... Is this attribute used to also evict objects from the store?

richardliaw commented 4 years ago

yes, I think so - see https://ray.readthedocs.io/en/latest/memory-management.html

ocorcoll commented 4 years ago

Those changes helped to improve memory consumption but I still see the same issue. Ray keeps consuming memory until reaches the cluster limit. The closest example I found to my situation is the following:

import ray
import time
import numpy as np

@ray.remote
class Store:

    def __init__(self):
        self.current_val = 0

    def set_val(self, val):
        self.current_val = val

    def get_val(self):
        return self.current_val

@ray.remote
class Worker:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            val = np.zeros((100, 100))
            self.store.set_val.remote(val)
            time.sleep(.1)

@ray.remote
class Reader:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            ray.get(self.store.get_val.remote())
            time.sleep(.1)

if __name__ == '__main__':
    ray.init(memory=int(.8 * 1e9), num_cpus=2, object_store_memory=int(1.2 * 1e9), driver_object_store_memory=int(.2 * 1e9))
    store = Store.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote()
    worker = Worker.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    reader = Reader.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    ray.wait([worker.run.remote(), reader.run.remote()])

Ray seems to not evict any entry and memory keeps growing, I had to stop at 1.5GB. Seems that the actor's object_store_memory limit is not used for eviction. Does Ray use the overall machine memory to start evicting or the object_store_memory limit in the ray.init call? What would be the expected behavior here? Is there a way to output the number of evictions/objects in the store?

cread commented 3 years ago

I'm not able to reproduce this problem using ray 1.4.0 (with the kwargs for init updated to match their new names).

Does that mean it's been fixed somewhere?

mirekphd commented 2 years ago

Yes, one sees it clearly when a container with Ray server is run separately from the (python) client app container (e.g. Jupyter Notebook server), which allows us to distinguish client and Ray server memory usages...

This behavior should be better documented I think. This issue would probably be unnecessary if memory was instead called memory_request (and --memory-request in the CLI). Currently all other resources [1] act as hard limits, while memory / --memory is definitely not a hard limit... hence the confusion IMO.

_[1] i.e. object store memory and CPU / GPU numbers (provisioned with object_store_memory / --object-store-memory, num_cpus / --num-cpus and num_gpus / --num-gpus)_

Ray doesn't actually enforce limits on memory -- that is only a hint for scheduling purposes (and only if you specify memory requests in task decorators). So it is up to your application to not exceed its memory.

mirekphd commented 2 years ago

Currently the only workaround I have found that will effectively limit Ray's total memory usage, i.e. equalize the memory size limit reported in Ray Dashboard with that reported by ray status CLI command (the latter being merely burstable request, not hard limit) is by imposing hard memory limits on the Ray Core container using Kubernetes (or Openshift/OKD) spec.containers[0].resources.limits.memory key, setting it in your Deployment[Config] or Pod to the desired value (which you hoped that ray init --memory alone would achieve), e.g. thus:

spec:
  containers:
  - name: memory-demo
    image: my-ray-core:latest
    resources:
      limits:
        memory: "32Gi"