ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.51k stars 5.51k forks source link

Memory Usage #733

Closed gray-m closed 6 years ago

gray-m commented 7 years ago

I've been attempting to use Ray for some reinforcement learning applications. When I have run multiple agents (implemented as ray Actors so that they train in parallel), the memory used by the driver quickly reaches nearly the memory of the computer I'm using (8GB). I suspect that this is due to the object store serializing the arguments of all of my calls to the agents' train functions, and I was wondering whether there was a way to remove objects if they will not be read again, or choose not to keep certain function arguments after they have been read once.

Thanks for any help.

robertnishihara commented 7 years ago

What happens when you run the application longer? Does it run out of memory and crash? Or does the memory usage plateau?

The object store does not evict objects until it fills up. So it will use up more and more memory until it gets full at which point it will start evicting old objects. This may be what you're seeing.

One thing that's currently not exposed through the API, but could be tried, is to start the object store with less memory. As a quick hack, you could try modifying this line https://github.com/ray-project/ray/blob/master/python/ray/services.py#L629 to pass in a different amount of memory.

Let me know if that has the desired effect.

There currently isn't a way to evict objects earlier, though it would be nice to provide a way to run with a more aggressive eviction policy or something like that. We've actually experimented with implementing referencing counting in the past to figure out when an object will no longer be used, but it ended up adding more overhead than we wanted. But some things like this could probably be done in the future.

robertnishihara commented 7 years ago

Also, if you have a simpler example that I can run to reproduce the problem, feel free to share it, and I can take a look.

gray-m commented 7 years ago

Thanks for the quick response!

Unfortunately, that seems not to work. I changed it to int(0.01 * psutil.virtual_memory().total), and while it does appear to be evicting objects when there is not enough room (There is not enough space to create this object, so evicting 15129 objects to free up 16490610 bytes.), the memory usage I see on htop still increases steadily.

Here's a simple example that produces similar behavior to my program:

import ray
ray.init()

@ray.remote
class Foo(object):

  def bar(self, list_arg, float_arg, bool_arg):
    return 0

actors = [Foo.remote() for _ in range(20)]

for _ in range(100000):
  for actor in actors:
    args = (list(range(20)), 0.0, True)
    actor.bar.remote(*args)`
robertnishihara commented 7 years ago

Thanks for sharing the script! The script ran alright for me without using much memory (on MacOS). What platform are you using?

When we put objects in the object store, the object store creates memory-mapped files to store the objects. Each memory-mapped file stores multiple objects and so can be much bigger than the individual objects. It's possible that htop is counting the full size of the memory-mapped files which is why it's using a lot of memory even though the objects here are pretty small.

I'm not totally sure how htop does bookkeeping when multiple processes are using shared memory, so one possibility is that there is some double counting.

gray-m commented 7 years ago

No problem! I'm running it one more time to see if the same problems happen. The reason I originally posted the issue was because when I first ran the training script, my computer eventually slowed down to the point that I thought my keyboard and mouse were broken.. I tested it again using htop to view memory usage, and it seemed to get close to full right as things became unresponsive that time (I thought maybe it was thrashing). I'm running Ubuntu 16.04 at work, but I'll try it on my Macbook at home as well.

gray-m commented 7 years ago

This time at around the 50,000th iteration of the for loop, my computer became slower to respond (htop showed about 7.3GB memory usage, and 273MB swap). I'll post again once I've run it on my laptop.

gray-m commented 7 years ago

Alright, it seemed to work fine on my laptop running Mac OS. The script finished without any problems. I'm not sure why there would be such a large difference between the Mac and Ubuntu platforms, though.

robertnishihara commented 7 years ago

That's good to hear :) Maybe it was starting to swap on your other machine, although that's a bit strange because the objects you were using were pretty small, so it shouldn't have approached anything near 8GB (also, starting the object store with less memory should have fixed that).

Can the script be simplified even further? For example, does the problem still happen if you do the following?

  1. Replace args = (list(range(20)), 0.0, True) with just args = 1 (and change the function signature appropriately). Does the problem happen more quickly or take longer to happen if you make the args bigger or smaller?
  2. Remove the actors and just use a remote function
    @ray.remote
    def bar(args):
        return 0
gray-m commented 7 years ago

Alright, tried both of those. When the argument to the function was just arg = 1, the memory usage was similar to when more arguments were passed. Remote functions on their own seem not to cause any great amount of memory consumption.

gray-m commented 7 years ago

@robertnishihara Alright, I'm currently running the script on my laptop, but this time running Ubuntu 16.04. There seems to be similar memory consumption to what I was seeing at work, so I think it might be an Ubuntu-specific problem.

gray-m commented 6 years ago

@robertnishihara Has this been resolved/looked into? I understand that there are other pressing issues to deal with.

robertnishihara commented 6 years ago

I haven't been able to reproduce it. I tried on Linux 16.04 but didn't see the problem. I'd like to get it resolved.. have you been able to reproduce it on EC2? If so, it'd be easy to share an AMI or something so I could see exactly what you're seeing.

20chase commented 6 years ago

@robertnishihara

I also came cross the memory usage problem on RL environments. Here is my test code:


#! /usr/bin/env python3
import ray
import gym

import numpy as np

@ray.remote
class RayEnvironment(object):
    def __init__(self, env):
        self.env = env
        state = self.env.reset()
        self.shape = state.shape

    def step(self, action):
        if self.done:
            return [np.zeros(self.shape), 0.0, True]
        else:
            state, reward, done, info = self.env.step(action)
            self.done = done
            return [state, reward, done]

    def reset(self):
        self.done = False
        return self.env.reset()

def run_episode(envs):
    terminates = [False for _ in range(len(envs))]
    terminates_idxs = [0 for _ in range(len(envs))]

    states = [env.reset.remote() for env in envs]
    states = ray.get(states)
    while not all(terminates):
        next_step = [env.step.remote([0.]) for i, env in enumerate(envs)]
        next_step = ray.get(next_step)

        states = [batch[0] for batch in next_step]
        rewards = [batch[1] for batch in next_step]
        dones = [batch[2] for batch in next_step]

        for i, d in enumerate(dones):
            if d:
                terminates[i] = True
            else:
                terminates_idxs[i] += 1

if __name__ == '__main__':
    ray.init()
    env_name = 'Pendulum-v0'

    envs = [gym.make(env_name) for _ in range(8)]
    envs = [RayEnvironment.remote(envs[i]) for i in range(2)]

    for i in range(100000):
        trajectories = run_episode(envs)
        print (i)

The code run on my laptop which has 12g memory. The test OS is Ubuntu 16.04. The problem is the code would use up the memory usage while I think 12g memory is enough for the program.

Looking forward your reply : )

robertnishihara commented 6 years ago

Interesting, if you open up top, which process is using up all the memory? It could be related to https://groups.google.com/forum/#!topic/ray-dev/bmF33z9HtKA.

20chase commented 6 years ago

Here is my monitor information:

memory_top_up

eparisotto commented 6 years ago

Was a workaround ever found for this? I run into the same bug running similar code using the GymEnvironment Actor class given in the tutorial at http://ray.readthedocs.io/en/latest/actors.html :


import os
import ray
import gym

import numpy        as np
import numpy.random as npr

ray.init()

ENV_NAME='Pong-v0'
NUM_ENVS=10

@ray.remote
class GymEnvironment(object):
    def __init__(self, name):
        self.env = gym.make(name)

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        return self.env.reset()

os.environ['OMP_NUM_THREADS'] = '1'

envs = [GymEnvironment.remote(ENV_NAME) for i in range(NUM_ENVS)]
states = [envs[i].reset.remote() for i in range(NUM_ENVS)]

env = gym.make(ENV_NAME)

while True:
    states = [ray.get(states[i]) for i in range(NUM_ENVS)]

    envs_alive = [True]*NUM_ENVS
    while True:
        step_ret = [ray.get(envs[i].step.remote(npr.randint(env.action_space.n))) if envs_alive[i] else None
                    for i in range(NUM_ENVS)]
        for i in range(NUM_ENVS):
            if envs_alive[i]:
                states[i] = step_ret[i][0]
                envs_alive[i] = envs_alive[i] and not step_ret[i][2]
        if not np.any(envs_alive):
            break
    states = [envs[i].reset.remote() for i in range(NUM_ENVS)]

The redis-server process grows in memory usage until there is none left, crashing the program with the following traceback:

Traceback (most recent call last):
  File "debug.py", line 38, in <module>
    for i in range(NUM_ENVS):
  File "debug.py", line 38, in <listcomp>
    for i in range(NUM_ENVS):
  File "/.../ray/python/ray/worker.py", line 2315, in get
    value = worker.get_object([object_ids])[0]
  File "/.../ray/python/ray/worker.py", line 483, in get_object
    i + ray._config.worker_fetch_request_size())])
  File "plasma.pyx", line 558, in pyarrow.plasma.PlasmaClient.fetch
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Broken pipe

Ray occasionally prints:

There is not enough space to create this object, so evicting X objects to free up Y bytes.

But the memory usage still slowly increases.

I tried it on two servers, one running Ubuntu 14.04 and another running Ubuntu 17.04. They were also using Python 3 (Anaconda 5.1.0) and had ray version 0.3.1 installed using both 'pip install ray' and 'pip install git+https://github.com/ray-project/ray.git#subdirectory=python'.

robertnishihara commented 6 years ago

As of #1824, you can flush memory used by the redis servers by doing

ray.experimental.flush_redis_unsafe()
ray.experimental.flush_task_and_object_metadata_unsafe()

1824 hasn't been merged yet, but you can use the function definitions from the PR. This is unsafe in the sense that it should only be done if you don't need any of the previous task/object metadata.

Please try this is a workaround for now for the issue where Redis is using too much memory.

robertnishihara commented 6 years ago

Closing for now. Subsequent problems can be raised in new issues.