ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.14k stars 5.61k forks source link

Core: deamonset feature request #25334

Open sasha-s opened 2 years ago

sasha-s commented 2 years ago

Description

I'd like to have something similar to a daemonset. I want to run a copy of an actor on each node (including the nodes that are started by autoscaler). I want to able to get an actor handle for the representative of my deamonset on a given node.

Use case

This can be useful for doing one-time per-node initialization, as well as for handling shared resources.

@rkooo567

DmitriGekhtman commented 2 years ago

cc @clarng @wuisawesome @simon-mo It's not completely trivial to get Ray node downscaling to work with an actor daemonset, see this issue with downscaling with serve: https://github.com/ray-project/ray/issues/25414

Some ideas to avoid this problem:

  1. Only allows daemons that don't use Ray resources (i.e. num_cpus=0 actors).
  2. Modify interfaces such that autoscaler recognizes "daemon" resource bundles which do not block downscaling.
wuisawesome commented 2 years ago

I'm in favor of (2). @scv119 thoughts?

DmitriGekhtman commented 2 years ago

2': same as 2, but idle calculation moved into GCS.

In principle, could use 1. as a starting point. But you'd need such daemons to be light in terms of actual resource usage.

tardieu commented 2 years ago

+1

We have the same need in the context of multi-model serving. We would like to run a singleton “controller” actor on each ray node. This actor maintains a catalog of available models on the node, metrics, some information about remote nodes, etc.

There should be one actor instance per node. Actor instances can be preempted and garbage collected on scale down. New actor instances should be allocated on scale up. The resources required by the actor instance should be reserved on each node (prior to node creation for fresh nodes) so that the actor instance can always be started (or restarted after a crash). This is akin to the resource reservations of placement groups but accounting for scale up and scale down events. The per-node reservation is the same for all the current and future nodes. The total reservation at any point in time is proportional to the number of live nodes.

The controller actors should be individually addressable with stable addresses (names) that remain the same if an instance has to be restarted after a crash. This stable address is tied to the node. We do not need the actor instances to be numbered from 1 to n where n is the number of live nodes at this point in time.

In our specific use case, since the actor instance primarily maintains information about the node it is running on, it is ok to loose the in-memory state of this actor on scale down. However, we would like to know ideally via a system notification sent to one or all of the remaining instances.

DmitriGekhtman commented 2 years ago

There is some practical urgency coming from https://github.com/ray-project/ray/issues/25414.

Ray Serve uses .01 of the individual node resource to schedule proxy actors on every node in the cluster

resources_available {
    key: "node:10.244.0.9"
    value: 0.99
  }

The issue is that this blocks downscaling.

Referring to discussion above

Some ideas to avoid this problem:

  1. Only allows daemons that don't use Ray resources (i.e. num_cpus=0 actors).
  2. Modify interfaces such that autoscaler recognizes "daemon" resource bundles which do not block downscaling.

The quickest thing to do is to ignore individual node resources in the downscaling considerations.

That would implement 1. using 2.

scv119 commented 2 years ago

so looks we have two requirements:

I have a few follow up questions on the requirements:

DmitriGekhtman commented 2 years ago

adding context: we've solved the current serve-specific problem by using the node affinity scheduling strategy This method enables basic support for daemon-like actors that don't use resources (and thus don't block downscaling)

DmitriGekhtman commented 2 years ago

for daemon actors, do we see the value of schedule them individually?

What do you mean?

do you want to composite the singleton-per-node invariant with other placement constraints (i.e. use it together with placement group?)

I imagine many use-cases would require both daemons and placement groups, so these things should work well together. The hard part is managing infeasibility, which if IIUC is not solved yet for placement groups. (Or is it? What's the behavior for placement groups that don't fit?)

do you want to control the lifetime of those actors differently?

There would have to be some auto-restart mechanism for these? For example, if you ray.kill one of these actors, it should be re-created. To get rid of the daemonset, you have to call an API that deletes all of the daemon actors and prevents new ones from being scheduled on new Ray nodes.

scv119 commented 2 years ago

for daemon actors, do we see the value of schedule them individually?

What do you mean?

do we want to schedule a daemon actor only for single node? so the actor won't block down scaling

wuisawesome commented 2 years ago

A daemonset implies 2 things

  1. The actor can be safely ignored for downscaling purposes.
  2. A replica of the actor is automatically started on all nodes.

In k8s you typically don't interact directly with a pod in daemonset via k8s apis, you just try talking to some local port/resource and trust the daemonset pod is configured to be on that port/resource.

I'm not sure if ray users use this pattern, or if we want to encourage it. (do people ever want to do this via STRICT_PACK things today?)

tardieu commented 2 years ago
  • do you want to composite the singleton-per-node invariant with other placement constraints (i.e. use it together with placement group?)

Yes. Ideally there should be several orthogonal controls for:

The placement specification should be more general that one per node and include ways to filter nodes (k8s uses labels) and multiplicity (say 2 per node to tolerate a failure of an actor instance).

The filtering could be based on custom resource (only place an actor instance on a node with resource X) assuming enough control over the distribution of custom resources (going down the path toward heterogeneous ray clusters).

tardieu commented 2 years ago
  • do you want to control the lifetime of those actors differently?

Yes in the sense that the lifetime should be managed at the group level and not for individual actors.

tardieu commented 2 years ago

For the singleton-per-node use case, it makes sense to access a specify actor by combining an identifier for the daemonset and one for the node, basically a "port" + an "ip". Beyond one actor per node, we need something else, probably a node-local index, which is much easier to maintain that a global index.

tardieu commented 2 years ago
  • do we want to schedule a daemon actor only for single node? so the actor won't block down scaling

The node filtering mechanism I suggested above would cover this scenario when using the existing "node:X.X.X.X" resources.

tardieu commented 2 years ago

In k8s you typically don't interact directly with a pod in daemonset via k8s apis, you just try talking to some local port/resource and trust the daemonset pod is configured to be on that port/resource.

The failure/health semantics would be important here. If a daemonset is defined but the matching local actor does not exist yet, what happens when invoking the actor synchronously or asynchronously. If the actor fails what happens to pending invocations, what happens to invocations while the actor is being reconstructed?

The specifics are important, the existence of a specification is even more important to permit application code to decide which/when retries are needed (vs. relying on retries built into the ray runtime).

In the serving use case I described earlier, we set max_restarts=-1 on actor instances in the "daemonset" but we do not set max_task_retries as we want the actor to be recreated on failure but we want pending tasks to fail fast when the actor instance goes down. I can envision scenarios where the instances of a daemonset would be configured instead with both max_restarts=-1 and max_task_retries=-1.