Open sven1977 opened 3 years ago
Do you know what quantity Memory usage on this node: 9.6/59.9 GiB
is referring to? ray.available_resources()["memory"]
?
@sven1977 can you run this on a real cluster and verify if this is a cluster utils issue?
Actually, not sure. The Memory usage printed is the tune.run output. @krfricke @richardliaw ?
can you run this on a real cluster and verify if this is a cluster utils issue?
Will do! Thanks for looking into this @wuisawesome !
@amogkam FYI I'm removing the release blocker tag because @sven1977 confirmed it's working on a physical cluster. Seems to be a problem only for Ray developers.
Latest ray master Py3.7 Linux p3.2xlarge
I'm observing a memory leak when running ray/release/long_running_tests/workloads/apex.py, but only when run on a ray.cluster_utils.Cluster with 3 nodes:
This is NOT due to the replay buffer of APEX growing (it set to only 10k and is already full when the leaking still continues).
The leak does not occur when I change the following inside: ray/release/long_running_tests/workloads/apex.py:
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".