ray-project / ray-legacy

An experimental distributed execution engine
BSD 3-Clause "New" or "Revised" License
21 stars 18 forks source link

Memory on remote node increases #448

Closed Kuurusch closed 4 years ago

Kuurusch commented 4 years ago

Hallo,

I've setup successfully a cluster (ray 0.8.5) with two machines and I calculate on it different models in parallel. I do this for different settings of my model in python several hundred times in a for-loop and initialize and shutdown each time ray to clear the memory. This procedure worked on the local machine perfectly but on the remote node the memory is constantly increasing each time I initialize ray new. How can I clear the memory remotly for the remote node?

I'm working on two computers with Ubuntu 18.04LTS

Many thanks for your help!

Kuurusch commented 4 years ago

@robertnishihara I would expect, that shutdown() behaves the same like on the local machine, but seems to be not the case. Why? Is this a bug?

So what works on the local machine but not on a cluster:

ray.init() ... ray.shutdown()

robertnishihara commented 4 years ago

How are you starting Ray? If you do ray.init() it only starts it on one machine. Are you running ray start ... on the different machines to start a Ray cluster?

Kuurusch commented 4 years ago

@robertnishihara Yea, I start on each machine Ray with ray start ... to create a cluster. Afterwards I start calculating my models with ray.init(ip-of-cluster-head) and it distributes the workers over the cluster. But the problem with this approach is, that I cant clear the memory on the remote nodes of the cluster after each run with ray.shutdown().

So when I then restart a new calculation with ray.init((ip-of-cluster-head), the memory increases until there is no memory anymore available on the remote nodes.

I would assume, that the behavior is the same, if I run Ray only locally or on a cluster, thought that this is the idea of Ray.

The reason of the increasing memory is, that the processes of the already finished workers are not closed in ray and the only possibility to achiev this was to call ray.shutdown() and reinitialize it with ray.init()

With finished workers I mean when ray.get(calculateWraper.remote(...)) returned the result and is not anymore needed in the current iteration.

istoica commented 4 years ago

Adding @stephanie-wang and @ericl.

robertnishihara commented 4 years ago

Migrating this to https://github.com/ray-project/ray/issues/8822.