[Core] Nested remote function slow start up time

sitaowang1998 commented 2 months ago

What happened + What you expected to happen

Symptom

Ray has significant overhead when running nested remote function for first time.

In the chart above, blue, green and purple bar stands for the time used by the actual functions, orange and red is the time spent by Ray in between. For the first batch of tasks, Ray takes a lot of time to start each nested remote function. For later tasks, such overhead is more rare.

Analysis

Since there are more tasks than CPUs, the top layer function consumes all CPUs, and Ray have to start new worker process for second and third layer functions. The logs show that most of the time is spent between the worker starting a new worker process and the new worker process initializing CoreWorker and connect itself to raylet. However, Ray spend much less time creating worker processes for the first layer functions.

Versions / Dependencies

Ray 2.9.3 Python 3.8.0 OS: Ubuntu 18.04.1 LTS

Reproduction script

import ray
import time
from datetime import datetime

import json

@ray.remote
def sleep_iter(iter: int):
    start_time = datetime.now()
    time.sleep(0.3)
    end_time = datetime.now()

    iter = iter - 1
    if iter > 0:
        times = ray.get(sleep_iter.remote(iter))
        return [start_time.timestamp() * 1000, end_time.timestamp() * 1000] + times

    return [start_time.timestamp() * 1000, end_time.timestamp() * 1000]

num_tasks = 100 # larger than number of cpus in the cluster
tasks = [sleep_iter.remote(3) for _ in range(num_tasks)]
ray.get(tasks)

Issue Severity

Low: It annoys or frustrates me.

jjyao commented 2 months ago

Hi @sitaowang1998, maybe you are hitting worker_maximum_startup_concurrency?

sitaowang1998 commented 2 months ago

There is no warning about exceeding the worker_maximum_startup_concurrency in the log.

ray-project / ray