ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.61k stars 5.71k forks source link

[Feature] [Serve] Support Sticky Sessions for Stateful Workflows Deployed via Ray Serve #20107

Open klwuibm opened 2 years ago

klwuibm commented 2 years ago

Search before asking

Description

As a data scientist, I would like to use Ray Serve to deploy stateful Ray workflows to take advantage of auto-scaling of Ray Serve deployment.  When a stateful workflow is deployed, the workflow's state may change as incoming requests are processed by the workflow.  Hence, the incoming requests with the same key/ID should be routed to the same Ray Serve replica that executes the workflow.  Namely, the routing of incoming requests to the Ray Serve replicas should be sticky and should be based on a key/ID.

However, the current routing strategy of Ray Serve for incoming requests to a replica is round-robin, i.e., an incoming request can be routed to any Serve replica. 

Use case

As an example, assume a Ray Workflow has been deployed via Ray Serve for a fraudulent credit card transaction detection application. Each workflow is specific to a card number and holds a state that changes as new transactions are continuously processed. New transactions are sent to the deployment for flagging potential fraudulent transactions. They should be routed to the proper Serve replicas running workflows with specific card numbers. This type of applications requires low latency and auto-scaling, as the incoming requests can be bursty.

Success metric: implementing sticky sessions should see reduced response latency and increased concurrent workflow throughput.

Related issues

One potential way to implement this feature is to associate a key/ID with a request, and use a hash function to route the request to a Ray Serve replica.  Since there tends to be a lot more different keys/IDs from the incoming requests than the number of Ray Serve replicas, load balancing among the replicas needs to be addressed.  

Additionally, a Ray Serve replica is likely to process requests coming to workflows with different keys/IDs, some of the workflow states may be overflown out of the main memory and saved to disks.  As a result, the workflow states might have to be loaded back from disks when they are needed to process a new incoming request.

Are you willing to submit a PR?

edoakes commented 2 years ago

Hey @klwuibm this use case looks really exciting and thanks for the detailed feature request!

This is something we've thought about in the past, but didn't have any users with strong requirements for it yet. I think getting a basic version of this off the ground would be straightforward, but a "complete" solution would be a decent chunk of work. Specifically, handling scaling up/scaling down the number of replicas would require some further thought because we need some way to repartition/can no longer simply hash and shard across the replicas. Handling rolling upgrades also requires further thought.

For a basic version that doesn't handle scaling or upgrading though, I think all we'd need to do is the following:

How urgent is this for your use case? We are pretty resource-constrained right now so I'm not sure when we'd be able to prioritize this feature. We would be happy to help with a design & shepherd a contribution if you are willing/able to work on it!

ericl commented 2 years ago

@klwuibm how about using the virtual actor feature of workflows for this? That's the intended solution for this class of use cases. Workflows guarantees transactional state updates on virtual workflow actors, so any Serve replica can talk to the virtual actor. Also, virtual actors can launch long-running sub-workflows so you aren't limited to actor API: https://docs.ray.io/en/master/workflows/actors.html#long-lived-sub-workflows

klwuibm commented 2 years ago

@edoakes It is not urgent urgent. I would say that it is probably medium for now. @yuanchi2807 and I are creating some example use cases to experiment and better understand the interplay of Serve and Workflow virtual actors, and the performance implications.

Yes, we could look into this feature and work with you guys.

@ericl Thanks for the pointers. I will play with it.

yuanchi2807 commented 2 years ago

Conceptually - is a replica mapped to a deployed instance, ie. one to one or a replica manages multiple deployed instances ie. one to many?

edoakes commented 2 years ago

One deployment => multiple physical replicas

yuanchi2807 commented 2 years ago

One deployment => multiple physical replicas

Thanks. Confirming one replica => one Ray actor and it is registered with a local node router as in https://github.com/ray-project/ray/blob/master/python/ray/serve/replica.py#L182

edoakes commented 2 years ago

Yes, one replica maps to one ray actor. There is no per-node router, communication is directly actor<->actor.

klwuibm commented 2 years ago

Based on today's discussions with @jiaodong and @iycheng, we think that to meet the requirement of our use cases, instead of using replica, we should do multiple deployments, each with a unique ID. Requests with the same ID will be processed by the same deployment. Namely, the sticky sessions will be at the deployment level, not at the replica level. Also, in order to have a meaningful performance advantage, we need the virtual actors in our workflow to have caching capability, namely virtual actor states should be in the plasma store, instead of disks. Until workflow has such feature, we should probably table this requirement for the time being. FYI: @yuanchi2807 @edoakes