Open ocorcoll opened 4 years ago
Ray doesn't actually enforce limits on memory
-- that is only a hint for scheduling purposes (and only if you specify memory requests in task decorators). So it is up to your application to not exceed its memory.
@ocorcoll one thing that we had to do with YARN was actually also set ray.init(driver_object_store_memory=DRIVER_MEMORY...)
.
Thanks @richardliaw, that together with setting memory per actor .options(num_cpus=1,memory=int(1e9))
seems to avoid memory issues. Since Ray does not limit the amount of memory consumed, I wonder why setting this attribute makes this issue go away... Is this attribute used to also evict objects from the store?
yes, I think so - see https://ray.readthedocs.io/en/latest/memory-management.html
Those changes helped to improve memory consumption but I still see the same issue. Ray keeps consuming memory until reaches the cluster limit. The closest example I found to my situation is the following:
import ray
import time
import numpy as np
@ray.remote
class Store:
def __init__(self):
self.current_val = 0
def set_val(self, val):
self.current_val = val
def get_val(self):
return self.current_val
@ray.remote
class Worker:
def __init__(self, store):
self.store = store
def run(self):
while True:
val = np.zeros((100, 100))
self.store.set_val.remote(val)
time.sleep(.1)
@ray.remote
class Reader:
def __init__(self, store):
self.store = store
def run(self):
while True:
ray.get(self.store.get_val.remote())
time.sleep(.1)
if __name__ == '__main__':
ray.init(memory=int(.8 * 1e9), num_cpus=2, object_store_memory=int(1.2 * 1e9), driver_object_store_memory=int(.2 * 1e9))
store = Store.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote()
worker = Worker.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
reader = Reader.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
ray.wait([worker.run.remote(), reader.run.remote()])
Ray seems to not evict any entry and memory keeps growing, I had to stop at 1.5GB. Seems that the actor's object_store_memory limit is not used for eviction. Does Ray use the overall machine memory to start evicting or the object_store_memory limit in the ray.init call? What would be the expected behavior here? Is there a way to output the number of evictions/objects in the store?
I'm not able to reproduce this problem using ray 1.4.0 (with the kwargs for init
updated to match their new names).
Does that mean it's been fixed somewhere?
Yes, one sees it clearly when a container with Ray server is run separately from the (python) client app container (e.g. Jupyter Notebook server), which allows us to distinguish client and Ray server memory usages...
This behavior should be better documented I think. This issue would probably be unnecessary if memory
was instead called memory_request
(and --memory-request
in the CLI). Currently all other resources [1] act as hard limits, while memory
/ --memory
is definitely not a hard limit... hence the confusion IMO.
_[1] i.e. object store memory and CPU / GPU numbers (provisioned with object_store_memory
/ --object-store-memory, num_cpus
/ --num-cpus
and num_gpus
/ --num-gpus
)_
Ray doesn't actually enforce limits on
memory
-- that is only a hint for scheduling purposes (and only if you specify memory requests in task decorators). So it is up to your application to not exceed its memory.
Currently the only workaround I have found that will effectively limit Ray's total memory usage, i.e. equalize the memory size limit reported in Ray Dashboard with that reported by ray status
CLI command (the latter being merely burstable request, not hard limit) is by imposing hard memory limits on the Ray Core container using Kubernetes (or Openshift/OKD) spec.containers[0].resources.limits.memory
key, setting it in your Deployment[Config]
or Pod
to the desired value (which you hoped that ray init --memory
alone would achieve), e.g. thus:
spec:
containers:
- name: memory-demo
image: my-ray-core:latest
resources:
limits:
memory: "32Gi"
What is the problem?
When running a ray script in Slurm (single node), it seems that ray is not respecting the memory limitations specified in ray.init. As I understand, the below script should fail with some memory limit error from Ray but instead is the cluster that fails. I book 40GB from the cluster and limit Ray memory to 1GB for workers and 1GB for the object store. It seems to consume the 40GB (surprisingly in 30 min)
Ray version: 0.8.0
Reproduction (REQUIRED)
and the slurm job script:
Run command:
Running the previous script produces the following Slurm error: