ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.95k stars 5.77k forks source link

Ray Serve often requires Ray restart to clean state #23043

Closed worldveil closed 2 years ago

worldveil commented 2 years ago

In debugging a script like this:

import ray
from fastapi import FastAPI
from ray import serve
from fastapi.responses import JSONResponse

# from nlp import DistilBERTSentimentModel, SentimentTwitterRobertaModel

app = FastAPI()
ray.init(ignore_reinit_error=True)  #address="auto")
serve.start(detached=True)

@serve.deployment(route_prefix="/predict")
@serve.ingress(app)
class MyFastAPIDeployment:
    def __init__(self):
        from nlp import DistilBERTSentimentModel, SentimentTwitterRobertaModel
        self.model_twitter_roberta = SentimentTwitterRobertaModel()
        self.model_distilBERT = DistilBERTSentimentModel()

    @app.get("/")
    def root(self, txt: str):
        # get results from twitter model
        twitter_model_results = self.model_twitter_roberta.predict(txt)
        twitter_label = max(twitter_model_results)[1]

        # get results from distilbert model
        distilBERT_model_results = self.model_distilBERT.predict(txt)
        bert_label = max(distilBERT_model_results)[1]

        # combine and return
        content = {"twitter": twitter_label, "distilBERT": bert_label}
        return JSONResponse(content=content)

MyFastAPIDeployment.deploy()

(I moved the import into the Actor init, so it will fail then)

I found debugging this to be pretty painful, I needed to use this line over and over to test:

ray stop && ray start --head && sleep 3 && python ray_serve_script.py

because often running

serve.shutdown()

wouldn't clean up processes, and ipython terminal would keep filling up with error messages and generally clobbering stdout. Without restarting Ray, I would see the autoscaler failing to run the deployments, like this:

(scheduler +4m19s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(ServeController pid=25995) 2022-03-09 18:14:54,676     WARNING deployment_state.py:1117 -- Deployment 'MyFastAPIDeployment' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 1}, resources available: {}. component=serve deployment=MyFastAPIDeployment

uncertain what root cause is, but Serve shouldn't require a full restart of Ray or head node to debug locally.

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!