ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.75k stars 5.74k forks source link

[Actor] Possible extra memory consumption #37291

Open zhaolazy opened 1 year ago

zhaolazy commented 1 year ago

What happened + What you expected to happen

When using Ray actors, I have noticed that Ray occupies some memory space that cannot be released. For example, a numpy array in my script takes approximately 3.8GB, but profiling results indicate that Ray actors always occupy twice the amount of memory. This can easily lead to out-of-memory (OOM) errors in high-concurrency scenarios. Is there a good way to release this memory?

(Actor2 pid=24216) Line #    Mem usage    Increment  Occurrences   Line Contents
(Actor2 pid=24216) =============================================================
(Actor2 pid=24216)     34    125.1 MiB    125.1 MiB           1       @profile
(Actor2 pid=24216)     35                                             def get(self):
(Actor2 pid=24216)     36    125.9 MiB      0.8 MiB           1           arrays = self._driver.gen()
(Actor2 pid=24216)     37                                                 # array = copy.deepcopy(array)
(Actor2 pid=24216)     38    125.9 MiB      0.0 MiB           8           arrays = [array for array in arrays]
(Actor2 pid=24216)     39   7754.7 MiB   7628.8 MiB           1           ret = np.vstack(arrays)
(Actor2 pid=24216)     40   7754.7 MiB      0.0 MiB           1           del arrays
(Actor2 pid=24216)     41   7754.7 MiB      0.0 MiB           1           print(ret.nbytes/1024/1024)
(Actor2 pid=24216)     42   7754.7 MiB      0.0 MiB           1           return ret

(Actor2 pid=29293) Line #    Mem usage    Increment  Occurrences   Line Contents
(Actor2 pid=29293) =============================================================
  ##### HOW TO RELEASE THIS 3.9GB? ####
(Actor2 pid=29293)     34   3941.8 MiB   3941.8 MiB           1       @profile   
(Actor2 pid=29293)     35                                             def get(self):
(Actor2 pid=29293)     36   3941.8 MiB      0.0 MiB           1           arrays = self._driver.gen()
(Actor2 pid=29293)     37                                                 # array = copy.deepcopy(array)
(Actor2 pid=29293)     38   3941.8 MiB      0.0 MiB           8           arrays = [array for array in arrays]
(Actor2 pid=29293)     39   7756.5 MiB   3814.7 MiB           1           ret = np.vstack(arrays)
(Actor2 pid=29293)     40   7756.5 MiB      0.0 MiB           1           del arrays
(Actor2 pid=29293)     41   7756.5 MiB      0.0 MiB           1           print(ret.nbytes/1024/1024)
(Actor2 pid=29293)     42   7756.5 MiB      0.0 MiB           1           return ret

Versions / Dependencies

Ray 2.4 Python 3.8 Linux 20.04

Reproduction script

import ray
import numpy as np
import time
from memory_profiler import profile

class Driver:
    def __init__(self) -> None:
        self._actors = [
            Actor1.remote()
            for i in range(5)
        ]

    def gen(self):
        return ray.get([
            actor.gen.remote()
            for actor in self._actors
        ])

@ray.remote
class Actor1:
    def gen(self):
        self._x = np.random.rand(100000000)
        return self._x

@ray.remote
class Actor2:
    def __init__(self, driver: Driver):
        self._driver = driver

    @profile
    def get(self):
        arrays = self._driver.gen()
        arrays = [array for array in arrays]
        ret = np.vstack(arrays)
        del arrays
        print(ret.nbytes/1024/1024)
        return ret

    @profile
    def x(self):
        time.sleep(10)
        return 1+1

if __name__ == "__main__":
    configs = {
        "memory_monitor_refresh_ms": 0,
        "memory_usage_threshold": 1,
        "free_objects_period_milliseconds": 0,
    }
    ray.init(_system_config=configs)

    driver = Driver()
    a2_ref = Actor2.remote(driver)

    while True:
        ray.get(a2_ref.get.remote())
        ray.get(a2_ref.x.remote())

Issue Severity

None

zhaolazy commented 1 year ago

anybody pls help me, 😢