Closed gray-m closed 6 years ago
What happens when you run the application longer? Does it run out of memory and crash? Or does the memory usage plateau?
The object store does not evict objects until it fills up. So it will use up more and more memory until it gets full at which point it will start evicting old objects. This may be what you're seeing.
One thing that's currently not exposed through the API, but could be tried, is to start the object store with less memory. As a quick hack, you could try modifying this line https://github.com/ray-project/ray/blob/master/python/ray/services.py#L629 to pass in a different amount of memory.
Let me know if that has the desired effect.
There currently isn't a way to evict objects earlier, though it would be nice to provide a way to run with a more aggressive eviction policy or something like that. We've actually experimented with implementing referencing counting in the past to figure out when an object will no longer be used, but it ended up adding more overhead than we wanted. But some things like this could probably be done in the future.
Also, if you have a simpler example that I can run to reproduce the problem, feel free to share it, and I can take a look.
Thanks for the quick response!
Unfortunately, that seems not to work. I changed it to int(0.01 * psutil.virtual_memory().total)
, and while it does appear to be evicting objects when there is not enough room (There is not enough space to create this object, so evicting 15129 objects to free up 16490610 bytes.
), the memory usage I see on htop still increases steadily.
Here's a simple example that produces similar behavior to my program:
import ray
ray.init()
@ray.remote
class Foo(object):
def bar(self, list_arg, float_arg, bool_arg):
return 0
actors = [Foo.remote() for _ in range(20)]
for _ in range(100000):
for actor in actors:
args = (list(range(20)), 0.0, True)
actor.bar.remote(*args)`
Thanks for sharing the script! The script ran alright for me without using much memory (on MacOS). What platform are you using?
When we put objects in the object store, the object store creates memory-mapped files to store the objects. Each memory-mapped file stores multiple objects and so can be much bigger than the individual objects. It's possible that htop is counting the full size of the memory-mapped files which is why it's using a lot of memory even though the objects here are pretty small.
I'm not totally sure how htop does bookkeeping when multiple processes are using shared memory, so one possibility is that there is some double counting.
No problem! I'm running it one more time to see if the same problems happen. The reason I originally posted the issue was because when I first ran the training script, my computer eventually slowed down to the point that I thought my keyboard and mouse were broken.. I tested it again using htop to view memory usage, and it seemed to get close to full right as things became unresponsive that time (I thought maybe it was thrashing). I'm running Ubuntu 16.04 at work, but I'll try it on my Macbook at home as well.
This time at around the 50,000th iteration of the for loop, my computer became slower to respond (htop showed about 7.3GB memory usage, and 273MB swap). I'll post again once I've run it on my laptop.
Alright, it seemed to work fine on my laptop running Mac OS. The script finished without any problems. I'm not sure why there would be such a large difference between the Mac and Ubuntu platforms, though.
That's good to hear :) Maybe it was starting to swap on your other machine, although that's a bit strange because the objects you were using were pretty small, so it shouldn't have approached anything near 8GB (also, starting the object store with less memory should have fixed that).
Can the script be simplified even further? For example, does the problem still happen if you do the following?
args = (list(range(20)), 0.0, True)
with just args = 1
(and change the function signature appropriately). Does the problem happen more quickly or take longer to happen if you make the args bigger or smaller?@ray.remote
def bar(args):
return 0
Alright, tried both of those. When the argument to the function was just arg = 1, the memory usage was similar to when more arguments were passed. Remote functions on their own seem not to cause any great amount of memory consumption.
@robertnishihara Alright, I'm currently running the script on my laptop, but this time running Ubuntu 16.04. There seems to be similar memory consumption to what I was seeing at work, so I think it might be an Ubuntu-specific problem.
@robertnishihara Has this been resolved/looked into? I understand that there are other pressing issues to deal with.
I haven't been able to reproduce it. I tried on Linux 16.04 but didn't see the problem. I'd like to get it resolved.. have you been able to reproduce it on EC2? If so, it'd be easy to share an AMI or something so I could see exactly what you're seeing.
@robertnishihara
I also came cross the memory usage problem on RL environments. Here is my test code:
#! /usr/bin/env python3
import ray
import gym
import numpy as np
@ray.remote
class RayEnvironment(object):
def __init__(self, env):
self.env = env
state = self.env.reset()
self.shape = state.shape
def step(self, action):
if self.done:
return [np.zeros(self.shape), 0.0, True]
else:
state, reward, done, info = self.env.step(action)
self.done = done
return [state, reward, done]
def reset(self):
self.done = False
return self.env.reset()
def run_episode(envs):
terminates = [False for _ in range(len(envs))]
terminates_idxs = [0 for _ in range(len(envs))]
states = [env.reset.remote() for env in envs]
states = ray.get(states)
while not all(terminates):
next_step = [env.step.remote([0.]) for i, env in enumerate(envs)]
next_step = ray.get(next_step)
states = [batch[0] for batch in next_step]
rewards = [batch[1] for batch in next_step]
dones = [batch[2] for batch in next_step]
for i, d in enumerate(dones):
if d:
terminates[i] = True
else:
terminates_idxs[i] += 1
if __name__ == '__main__':
ray.init()
env_name = 'Pendulum-v0'
envs = [gym.make(env_name) for _ in range(8)]
envs = [RayEnvironment.remote(envs[i]) for i in range(2)]
for i in range(100000):
trajectories = run_episode(envs)
print (i)
The code run on my laptop which has 12g memory. The test OS is Ubuntu 16.04. The problem is the code would use up the memory usage while I think 12g memory is enough for the program.
Looking forward your reply : )
Interesting, if you open up top, which process is using up all the memory? It could be related to https://groups.google.com/forum/#!topic/ray-dev/bmF33z9HtKA.
Here is my monitor information:
Was a workaround ever found for this? I run into the same bug running similar code using the GymEnvironment Actor class given in the tutorial at http://ray.readthedocs.io/en/latest/actors.html :
import os
import ray
import gym
import numpy as np
import numpy.random as npr
ray.init()
ENV_NAME='Pong-v0'
NUM_ENVS=10
@ray.remote
class GymEnvironment(object):
def __init__(self, name):
self.env = gym.make(name)
def step(self, action):
return self.env.step(action)
def reset(self):
return self.env.reset()
os.environ['OMP_NUM_THREADS'] = '1'
envs = [GymEnvironment.remote(ENV_NAME) for i in range(NUM_ENVS)]
states = [envs[i].reset.remote() for i in range(NUM_ENVS)]
env = gym.make(ENV_NAME)
while True:
states = [ray.get(states[i]) for i in range(NUM_ENVS)]
envs_alive = [True]*NUM_ENVS
while True:
step_ret = [ray.get(envs[i].step.remote(npr.randint(env.action_space.n))) if envs_alive[i] else None
for i in range(NUM_ENVS)]
for i in range(NUM_ENVS):
if envs_alive[i]:
states[i] = step_ret[i][0]
envs_alive[i] = envs_alive[i] and not step_ret[i][2]
if not np.any(envs_alive):
break
states = [envs[i].reset.remote() for i in range(NUM_ENVS)]
The redis-server process grows in memory usage until there is none left, crashing the program with the following traceback:
Traceback (most recent call last):
File "debug.py", line 38, in <module>
for i in range(NUM_ENVS):
File "debug.py", line 38, in <listcomp>
for i in range(NUM_ENVS):
File "/.../ray/python/ray/worker.py", line 2315, in get
value = worker.get_object([object_ids])[0]
File "/.../ray/python/ray/worker.py", line 483, in get_object
i + ray._config.worker_fetch_request_size())])
File "plasma.pyx", line 558, in pyarrow.plasma.PlasmaClient.fetch
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Broken pipe
Ray occasionally prints:
There is not enough space to create this object, so evicting X objects to free up Y bytes.
But the memory usage still slowly increases.
I tried it on two servers, one running Ubuntu 14.04 and another running Ubuntu 17.04. They were also using Python 3 (Anaconda 5.1.0) and had ray version 0.3.1 installed using both 'pip install ray' and 'pip install git+https://github.com/ray-project/ray.git#subdirectory=python'.
As of #1824, you can flush memory used by the redis servers by doing
ray.experimental.flush_redis_unsafe()
ray.experimental.flush_task_and_object_metadata_unsafe()
Please try this is a workaround for now for the issue where Redis is using too much memory.
Closing for now. Subsequent problems can be raised in new issues.
I've been attempting to use Ray for some reinforcement learning applications. When I have run multiple agents (implemented as ray Actors so that they train in parallel), the memory used by the driver quickly reaches nearly the memory of the computer I'm using (8GB). I suspect that this is due to the object store serializing the arguments of all of my calls to the agents' train functions, and I was wondering whether there was a way to remove objects if they will not be read again, or choose not to keep certain function arguments after they have been read once.
Thanks for any help.