ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

[core] Memory leak when using local simulated cluster (long_running_tests/workloads/apex.py) #15305

Open sven1977 opened 3 years ago

sven1977 commented 3 years ago

Latest ray master Py3.7 Linux p3.2xlarge

I'm observing a memory leak when running ray/release/long_running_tests/workloads/apex.py, but only when run on a ray.cluster_utils.Cluster with 3 nodes:

Memory usage on this node: 7.4/59.9 GiB
Memory usage on this node: 7.8/59.9 GiB
Memory usage on this node: 8.1/59.9 GiB
Memory usage on this node: 8.2/59.9 GiB
Memory usage on this node: 8.3/59.9 GiB
Memory usage on this node: 8.4/59.9 GiB
Memory usage on this node: 8.5/59.9 GiB
Memory usage on this node: 8.5/59.9 GiB
Memory usage on this node: 8.6/59.9 GiB
Memory usage on this node: 8.7/59.9 GiB
Memory usage on this node: 8.8/59.9 GiB
Memory usage on this node: 8.8/59.9 GiB
Memory usage on this node: 8.9/59.9 GiB
Memory usage on this node: 9.0/59.9 GiB
Memory usage on this node: 9.0/59.9 GiB
Memory usage on this node: 9.0/59.9 GiB
Memory usage on this node: 9.1/59.9 GiB
Memory usage on this node: 9.1/59.9 GiB
Memory usage on this node: 9.2/59.9 GiB
Memory usage on this node: 9.2/59.9 GiB
Memory usage on this node: 9.3/59.9 GiB
Memory usage on this node: 9.3/59.9 GiB
Memory usage on this node: 9.4/59.9 GiB
Memory usage on this node: 9.5/59.9 GiB
Memory usage on this node: 9.6/59.9 GiB
Memory usage on this node: 9.7/59.9 GiB
Memory usage on this node: 9.7/59.9 GiB
Memory usage on this node: 9.8/59.9 GiB
Memory usage on this node: 9.9/59.9 GiB
Memory usage on this node: 10.0/59.9 GiB

This is NOT due to the replay buffer of APEX growing (it set to only 10k and is already full when the leaking still continues).

The leak does not occur when I change the following inside: ray/release/long_running_tests/workloads/apex.py:

#cluster = Cluster()
#for i in range(num_nodes):
#    cluster.add_node(
#        redis_port=6379 if i == 0 else None,
#        num_redis_shards=num_redis_shards if i == 0 else None,
#        num_cpus=20,
#        num_gpus=0,
#        resources={str(i): 2},
#        object_store_memory=object_store_memory,
#        redis_max_memory=redis_max_memory,
#        dashboard_host="0.0.0.0")
#ray.init(address=cluster.address)
ray.init() # <- this works fine; the above leaks

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

wuisawesome commented 3 years ago

Do you know what quantity Memory usage on this node: 9.6/59.9 GiB is referring to? ray.available_resources()["memory"]?

wuisawesome commented 3 years ago

@sven1977 can you run this on a real cluster and verify if this is a cluster utils issue?

sven1977 commented 3 years ago

Actually, not sure. The Memory usage printed is the tune.run output. @krfricke @richardliaw ?

sven1977 commented 3 years ago
can you run this on a real cluster and verify if this is a cluster utils issue?

Will do! Thanks for looking into this @wuisawesome !

stephanie-wang commented 3 years ago

@amogkam FYI I'm removing the release blocker tag because @sven1977 confirmed it's working on a physical cluster. Seems to be a problem only for Ray developers.