Ray Serve IDLE processes using 90% of CPU time

DanielChern commented 1 month ago

Hey Ray team We noticed strange behavior with ray::IDLE threads, When ray applications use ~80% of CPU time 90% of that time is captured by a task called ray::IDLE. We captured it using perf sched record -- sleep 10 && perf sched latency -s runtime.

Also, you can see that each RayServe process has several IDLE processes, some of them tuning for hours.

Is this something that is expected and just a misleading thread name or maybe there is something wrong that we did?

Versions / Dependencies

Node: GCP C3-standard-22 (22 vCPU 88 GB RAM) Ray: ray-ml:2.24.0-py310-gpu Ray Operator : v1.1.0 OS: Linux 6.1.85

Reproduction script

import numpy as np
import PIL.Image as Image
from nbcm.embeddings.clip import ClipEncoder
from ray import serve
from typing import List

@serve.deployment
class ClipServe:
    def __init__(self):
        self._model = ClipEncoder()
        ... some internal logic ...

    @serve.batch(max_batch_size=32, batch_wait_timeout_s=1)
    async def process(self, images: List[np.ndarray]) -> List[np.ndarray]:
        self.metric_logger.log_batch_size(len(images))
        image_list = [Image.fromarray(img) for img in images]
        embeddings = self._model.process(image_list)
        return [embedding.flatten().astype(float).tolist() for embedding in embeddings]

app = ClipServe.bind()

Ray start command (3 instances per node):

ulimit -n 65536; ray start  --num-cpus=7  --memory=25000000000  --resources="{\"custom-node-cpu\":
      1}"  --address=clip-model-rayservice-raycluster-rnspv-head-svc.clip-model.svc.cluster.local:6379  --metrics-export-port=8080  --block  --dashboard-agent-listen-port=52365

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rkooo567 commented 1 month ago

Normally, when a new process is created, it becomes ray::IDLE. And whenever you schedule an actor, it becomes "ray::"

Did you observe 90% CPU live, or is it trace captured from some duration? If it is the latter case, I think it is probably possible the tool captures the name ray::IDLE and then something else is scheduled on it after that.

DanielChern commented 1 month ago

First of all, tnx for the explanation. this perf was captured from 10 seconds of CPU time, but we did that test many times on different nodes, and we saw the same result every time.

rkooo567 commented 1 month ago

is it possible to capture the flamegraph when it uses 90% of cpus? (maybe use py-spy)?

DanielChern commented 1 month ago

sure Here 10 seconds of py-spy profiles of the ray::ServeReplica processes (you can see that our BL uses only 28%) Not sure if that will give anything since this is just a parent process and I am not sure that the profiler counts child threads , I will try to take a profile of OS threads later if that will help

rkooo567 commented 1 month ago

So this is not ray::IDLE process right?

I guess it may be because ray::IDLE processes are started as soon as the cluster starts, so it has long cpu execution time when you aggregate threads run time? (so I feel like the result is expected)

DanielChern commented 1 month ago

Upon further investigation, it appears that the issue might simply be a matter of incorrect naming within the Linux threads. I took the profile of another actor which downloaded images from the internet and then converted/resized them with PIL library. Meanwhile, in the Linux profile, I noticed that ray::IDLE has an internal process called ImagingResampleHorizontal_8bpc, part of the PIL library. While this is only 8% of the profile, it still does not make sense why ray::IDLE would call that function and it sounds like part of our actor business logic. Does it sound reasonable to you? If so, the issue can be closed.

DanielChern commented 1 month ago

Ok so I managed to profile one of the IDLE processes and it does run our business logic ..... But it is only 20% of the run time (green box), what other 80% of the time I cannot understand (blue box)

rkooo567 commented 1 month ago

you can probably use --native flags to see what's running there. But it does seem like it is just the naming issue to me

DanielChern commented 1 month ago

Ok, in that case, do you want to close the issue or live it so you can take care of the naming part?

anyscalesam commented 2 weeks ago

I think the issue here would be in the perf sched profiler itself no? so the changd would have to be in the linux profiler itself?

anyscalesam commented 1 week ago

closing - please reopen if conclusion is incorrect.

ray-project / ray