[Core] Thread leaking when submitting actor tasks to actors, leading to reach to the system limit

rkooo567 commented 1 year ago

What happened + What you expected to happen

I have an actor that keeps track of file locations. It has exactly one instance running, and is an async actor. However, as training runs and more and more remote calls are made to the actor the thread count used by the actor process keeps increasing, until eventually it hits the system limit. I can increase the system limit, but I would like to understand what is happening.

paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        1176
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        2785
paperspace@pscjoxbd3:~$ cat /proc/4163053/status | grep Threads
Threads:        4949

This was over a period of about 3 minutes. Does each .remote to an actor method create a new thread?

# %%
import ray
ray.init()

# %%
import time

try:
    a = ray.get_actor("testactor")
    ray.kill(a)
except:
    pass

# %%
@ray.remote(resources={"ishead": 1})
class TestActor:
    def __init__(self):
        self.counter = 0

    async def increment(self):
        self.counter += 1
        return self.counter

    async def get_count(self):
        return self.counter

@ray.remote(num_cpus=0.1)
def call_actor(a):
    a.increment.remote()

@ray.remote(num_cpus=0.1)
def spawn_task(a):
    for _ in range(10000000):
        call_actor.options(scheduling_strategy="SPREAD").remote(a)
        time.sleep(0.005)
    print("done")

ta = TestActor.options(name="testactor").remote()
for _ in range(30):
    spawn_task.options(scheduling_strategy="SPREAD").remote(ta)

# %%
ray.get(ta.get_count.remote())

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

High: It blocks me from completing my task.

Wordyka commented 1 year ago

It's possible that the issue you're seeing is related to the way Ray handles the lifecycle of actors and the management of actor state.

When a Ray actor is created, Ray spawns a new process to run the actor's methods. This process will remain active for the lifetime of the actor, even if the actor is idle and not currently processing any requests. Each actor process has a limited number of threads that it can use to handle incoming requests, and if the actor receives more requests than it can handle with its existing threads, it will spawn additional threads to handle the load.

However, if the actor receives a large number of requests over time, and if those requests are slow to complete, it's possible that the actor's thread count can grow to a very large number, leading to the issue you're seeing. One possible cause of slow requests could be if the actor's state is very large, and if each remote method call needs to retrieve or modify a significant portion of that state.

To diagnose the issue further, you may want to use theray state command to inspect the memory usage of the actor and its associated actor process. You can also use the ray timeline command to inspect the timing of the actor's method calls, to see if there are any slow calls that may be contributing to the thread count growth.

To address the issue, you may want to consider breaking up the actor's state into smaller chunks, or using a different storage mechanism such as a distributed database or a message queue. You could also consider batching remote calls to the actor to reduce the overall number of requests it receives. Finally, you could consider implementing some kind of timeout or rate-limiting mechanism on the remote method calls to prevent them from overwhelming the actor's thread pool.

rynewang commented 1 year ago

According to https://docs.ray.io/en/latest/ray-core/actors/concurrency_group_api.html#defining-concurrency-groups by default each actor can't spawn more than 1000 threads? It should be a bug if we schedule in 4000 processes.

rynewang commented 1 year ago

The cause is like this:

When we have a lot of workers sending actor requests to the actor process, we maintain a per-sender (by worker id) task queue, each queue spawning a thread. This queue is never GC'd so we have excessive amount of queues and threads.

https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L230

ray-project / ray