ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.09k stars 5.6k forks source link

[rllib] Memory usage growing infinitely using QMIX #8763

Closed Svalorzen closed 3 years ago

Svalorzen commented 4 years ago

What is the problem?

I am training QMIX with a custom 6-agent environment, and memory usage just seem to grow infinitely over time. The problem might be related to #3884 but I am not sure.

The custom environment is wrapper around a dynamic C++ library built with Boost Python. I can share it if needed. I have tried to limit the memory of Ray in the init() call but it doesn't seem to have any effect. Memory usage grows slowly over time, reaching ~64GB after 1 hours.

Ray version: 0.8.0 Python version: 3.7.4 OS: CENTOS 7.7.1908 (cluster)

Reproduction

This is the Python script I am using (the first two imports are the custom libraries). Ray reports, after initialization:

Starting Ray with 29.79 GiB memory available for workers and up to 18.63 GiB for objects.

But total memory usage exceeds 64GB which makes the training crash as I don't have more.

import AIToolbox
import FFPython
import gym
from gym.spaces import Discrete, MultiDiscrete, Tuple, Dict
import ray
import numpy as np
from ray import tune
from ray.rllib.agents import qmix
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.rllib.agents.qmix.qmix_policy import ENV_STATE
from ray.tune.registry import register_env

class FF(MultiAgentEnv):
    def __init__(self, width, height, reach, maxFire, seed):
        super(FF, self).__init__()

        import AIToolbox
        self.model = FFPython.FFPython(width, height, reach, maxFire, seed)
        self.agents = width * height

    def step(self, action):
        a = [0] * self.agents
        for i in range(self.agents):
            a[i] = int(action[i])

        self.model.step(a)
        s = np.array(self.model.getState())
        r = np.array(self.model.getReward())
        ss = {}
        rr = {}
        for i in range(self.agents):
            ss[i] = { "obs": s, ENV_STATE: s }
            rr[i] = r[i]

        return ss, rr, {"__all__":False}, {}

    def reset(self):
        self.model.reset()

        s = np.array(self.model.getState())
        ss = {}
        for i in range(self.agents):
            ss[i] = { "obs": s, ENV_STATE: s }

        return ss

    def render(self):
        pass
    def close(self):
        pass

def env_creator(env_config):
    print("env_creator...")
    agents = env_config["width"] * env_config["height"]
    grouping = {
        "group_1" : list(range(agents))
    }
    ospace_one = Dict({
        "obs": MultiDiscrete([env_config["maxFire"]] * agents),
        ENV_STATE: MultiDiscrete([env_config["maxFire"]] * agents)
    })
    print(ospace_one["obs"].sample())
    ospace = Tuple([ospace_one] * agents)
    aspace = Tuple([Discrete(4)] * agents)

    return \
    FF(
        env_config["width"], 
        env_config["height"], 
        env_config["reach"], 
        env_config["maxFire"], 
        env_config["seed"]) \
    .with_agent_groups(
        grouping,
        ospace,
        aspace
    )

register_env('FF-v0', env_creator)

if __name__ == "__main__":
    config = qmix.DEFAULT_CONFIG.copy()
    config["gamma"] = 0.95

    config["env"] = 'FF-v0'
    config["env_config"] = {
        "width": 3,
        "height": 2,
        "reach": 1,
        "maxFire": 3,
        "seed": 0,
    }
    config["num_workers"] = 6

    ray.init(memory=32000000000)
    result = tune.run(
        "QMIX",
        stop = {
            "timesteps_total": 1000000,
            "episodes_total": 20,
            "time_total_s" : 3600 * 3 - 300
        },
        config = config,
        local_dir = "/scratch/brussel/102/vsc10219/ray_test_1"
    )
ericl commented 4 years ago

Can you reproduce this with a toy env? We can't debug scripts that aren't self contained.

HanbumKo commented 4 years ago

Hi, I got same error with this issue, any update here??

HanbumKo commented 4 years ago

image

Svalorzen commented 4 years ago

I think I risolved the problem, for me the length of the episodes was way too high, and so ray was trying to keep in memory a ton of experiences which basically ate infinite memory. Not sure if this is the same as what you are seeing.

HanbumKo commented 4 years ago

Oh OK mine is 2000 per episode, do you happen to remember the length?

duburcqa commented 4 years ago

I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6.

anguyenbus commented 4 years ago

I am using sac,

analysis = tune.run(sac.SACTrainer, 
                config={"env": "RsmAtt",
                       "num_gpus" : 1,
                       "num_workers" : 0,
                       "use_pytorch" : 1,
                       "framework" : 'torch',
                       "buffer_size" : int(1e4),
                       "rollout_fragment_length":100},
                stop={"training_iteration": 200},
                )

and memory keep increasing 300Mb each iteration

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

barbetcalls commented 3 years ago

I have the same issue using DDPG. I use 50 workers, and replay buffer of size 100000. It is consuming more than 60go after 50M iterations, and it is linearly increasing since the beginning. I'm using release 0.8.6.

Could you find the reason for the increasing memory issue?

duburcqa commented 3 years ago

No, I still have this issue...