Issue on @serve.deployment class with FastAPI deployment and module imports

davidberenstein1957 commented 3 years ago

These instructions do not seem to work for ray[serve] or ray[default] @1.3.0. Also, ray==2.0.0.dev0 has problems.

davidberenstein1957 commented 3 years ago

Ohh sorry, I see, you seem to be working on updates currently. Please send a request if you are looking for more context.

edoakes commented 3 years ago

Hey @davidberenstein1957 you're right that we're in the midst of some big updates! Could you say more about the problems you ran into on the nightly wheels (2.0.0.dev0)?

davidberenstein1957 commented 3 years ago

So, I was trying to serve my FastAPI application via /serve/tutorials/web-server-integration.html via 2.0.0.dev0. I really love the new interface copared to the old one. However, for FastAPI specifically, the requests are already formatted via the FastAPI decorators, so request.body() doesn't need to be there. Similarly, the asynchronous FastAPI endpoint don't need to be awaited anymore. Also, I seem required to use ray.get(response) to obtain the response from the ray serve handle.

Furthermore, the FastAPI app.on_event('startup') doesn't log/report erros, so if anything goed wrong it is very difficult to debug and people might blame Ray for this flaw. I think you could add an example including a try-except-statement in the example to avoid this.

davidberenstein1957 commented 3 years ago

I am currently experiencing something similar to the following issues: https://github.com/ray-project/ray/issues/8419 https://github.com/ray-project/ray/issues/3116

However, I am initializing the @serve.deployment on a class which is rather complex and inherits some stuff from other classes, i.e. it is inconvenient to have to create a setup.py file as suggested is 3116, also addding the import (8419) withint the ray remote is not an option because I use the @serve.deployment on a class that inherits another class.

davidberenstein1957 commented 3 years ago

Also, when using the main approach suggested here. It does connect, but I get an error. No module named 'transformers', i.e. it is unable to find the transformer package.

edoakes commented 3 years ago

This is really great feedback @davidberenstein1957 :) do you think you could provide a short code sample of exactly what you're trying to do? Off the top of my head, one thing you could do is make the deployment class a pretty simple wrapper of the actual underlying class that you're serving:

@serve.deployment
class Wrapper:
    def __init__(self, *args):
        from my_module import MyActualImplementationClass
        self._wrapped = MyActualImplementationClass(*args)

    def handle_request(self):
        return self._wrapped.handle_request()

davidberenstein1957 commented 3 years ago

Awesome, I will try this tomorrow during less ungodly workhours. But here you can find my deployed cluster (ai-dev_cluster.yaml - mostly the same as example_cluster.yaml) that is deployed within my kubernetes cluster within a namespace. Within that same cluster and namespace, I am trying to deploy a FastAPI application that unloads the heavy stuff to Ray serve deployed transformers (main.py + requirements.txt). For convenience, I now replaced my SentimentAnalyzers with the standard GPT2 class from the example of FastAPI deployment on your website. During application startup, the application fail when calling GPT2.deploy() on line 73. I get the error No module named 'transformers'. Running a local ray cluster via ray start --head or via from ray.cluster_utils import Cluster doesn't cause any issues, however within the kubernetes cluster it does. requirements.txt main.py.txt ai-dev_cluster.yaml.txt

davidberenstein1957 commented 3 years ago

By the way, I am using tiangolo/uvicorn-gunicorn-fastapi:python3.8-slim-2020-12-19 as Docker image to deploy my API to Kubernetes.

davidberenstein1957 commented 3 years ago

(pid=672, ip=10.240.0.185) File "python/ray/_raylet.pyx", line 500, in ray._raylet.execute_task (pid=672, ip=10.240.0.185) File "python/ray/_raylet.pyx", line 447, in ray._raylet.execute_task.function_executor (pid=672, ip=10.240.0.185) File "python/ray/_raylet.pyx", line 1657, in ray._raylet.CoreWorker.run_async_func_in_event_loop (pid=672, ip=10.240.0.185) File "/home/ray/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 432, in result (pid=672, ip=10.240.0.185) return self.__get_result() (pid=672, ip=10.240.0.185) File "/home/ray/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result (pid=672, ip=10.240.0.185) raise self._exception (pid=672, ip=10.240.0.185) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/backend_worker.py", line 71, in __init__ (pid=672, ip=10.240.0.185) await sync_to_async(_callable.__init__)(*init_args) (pid=672, ip=10.240.0.185) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/async_compat.py", line 29, in wrapper (pid=672, ip=10.240.0.185) return func(*args, **kwargs) (pid=672, ip=10.240.0.185) File "./main.py", line 52, in __init__ (pid=672, ip=10.240.0.185) ModuleNotFoundError: No module named 'functional'

davidberenstein1957 commented 3 years ago

I feel that the issues might have something to do with https://docs.ray.io/en/master/cluster/commands.html#synchronizing-files-from-the-cluster-ray-rsync-up-down, however, I would expect this to happen automatically when serving a specific application. Also, calling os.system('pip install transformers') within tthe wrapper fixed a ModuleNotFoundError: No module named 'transformers', but I don't think this is the intended way of fixing this issue?

davidberenstein1957 commented 3 years ago

Withinhttps://github.com/ray-project/ray/tree/master/python/ray/autoscaler I can see some examples of .yml pipeline files that actually include setup_commands and file sharing interfaces, but idealy I want to upload files based on my seperate fastAPI pods.

edoakes commented 3 years ago

@davidberenstein1957 we're currently working on improving support for specifying dependencies at for a given Serve deployment, but right now the best practice would be to build all of the requirements into the docker image you use for the ray cluster. That way the packages are available to all of the worker processes on your cluster. Would that work for you?

davidberenstein1957 commented 3 years ago

I could make this work for me, however, this does not fix my relative import issues, which means I would also have to add my entire application to the Ray Docker image. Ideally, something like the CommandRunner would be great, where I would first have to call ray.init() and afterwards can execute command on the cluster like ray rsync-up ./functional ./functional and ray exec pip install transformers. But I will also check if I might be able to initialize the class and add it to shared storage via ray.put() and initalize it using the wrapper and ray.get().

edoakes commented 3 years ago

We are currently working on better supporting dynamic environments (RFC here), but for now one thing you could do is use the named conda env support.

The workflow here would be to install a conda env on the cluster with a specific name using ray exec or some other means, then in your Serve deployment you specify that env as the one for the actors to run in:

@serve.deployment(ray_actor_options={runtime_env={"conda": "my_conda_env_name"}})
class Deployment:
    ...

edoakes commented 3 years ago

Actually one option for installing the env would be to do it using Ray tasks!

@ray.remote
def install_env(env_yaml):
    # write env_yaml to temp file
    subprocess.check_output(["conda", "create", tempfile])

You could schedule this to run on every node doing something like this:

refs = []
for node in ray.nodes():
    node_id = node["NodeManagerAddress"]
    node_resource = f"node:{node_id}"
    refs.append(install_env.options(resources={node_resource: 0.001}).remote(env_yaml))

ray.get(refs)

davidberenstein1957 commented 3 years ago

I just had everything up and running with the seperate dockerimage approach, but after 15 minutes I got the following error message: ray.util.client.dataclient - INFO - Server disconnected from data channel Also, the cluster doesnn't seem to autoscale when in need of additional resources even though it is allowed to scale to more workers within the config.

edoakes commented 3 years ago

@AmeerHajAli is the above the same issue that we recently addressed by adding gRPC keepalives to the Ray client?

Also, @davidberenstein1957 could you share the logs from the Serve controller (it should print some messages saying that deployments are pending startup) and the autoscaler logs? That should help diagnose the issue!

davidberenstein1957 commented 3 years ago

Hello, thanks for the great support by the way! I hope this is enough context.

My set-up. kubernetes cluster -> ray operator cluster. Note, the operator was not able to use my custom docker image so it is running an image without the installation of transformers and pytorch. ray.cluster.yml.txt ray.operator.yml.txt

4 ML FastAPI Microservices from complex to simple (Spacy, Classification, Wordembeddings, Sentiment). I am working my way to ray integration from the simplest service starting with the Sentiment service, which offers some transfer-learned huggingface Transformers. I tried deploying 3 Dutch versions and 1 English version via the following FastAPI set-up using 0.5 CPU per deployment. deployment.py.txt

The process starts working and my dashboard is showing how the models are loaded on the head node and re-distributed over the worker nodes. However, the head-node clogs up and ends up using too much resources and then the FastAPI connection dies to the scaling not working. Also, when deploying less models, the models end up remaining on the head node, which seems a bit weird to me. I would expect them to be moved to the, less crucial, worker nodes. Screenshot_2021-05-13 Ray Dashboard head_fail_logs.txt operator_logs.txt

davidberenstein1957 commented 3 years ago

@edoakes, @AmeerHajAli I found some insights into the issue, but it might be a diffcult fix from your side, i.e., it seems like a kubernetes and/or config issue from my side.

So, our kubernetes cluster is using X CPU and 2 GPU. Within my pipeline.yml for the deployment of the Ray cluster, and within my pipeline.yml for the deployment of my sentiment Microservice, I did specifically assign GPU resources via taint. So, when deploying the BERT transformer models with "@serve.deployment(num_replicas=2)", they were still initialized on the GPUs within our cluster, meaning that the Ray autoscaler seemed to be in a 'mismatch' with the actual GPU/CPU-usage, resulting in an error. Or could this be fixed by using the rayproject/ray:nightly-gpu image? It does not give an error after deploying the models with "@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 0.25})".

AmeerHajAli commented 3 years ago

cc @DmitriGekhtman @ijrsvt

davidberenstein1957 commented 3 years ago

I am getting an autoscaler error within my app.py file when in limited resources, een though the config is allowed to scale. autoscaler-error-log.txt

DmitriGekhtman commented 3 years ago

Thanks for the logs. That's interesting... evidently, the autoscaling process got a SIGTERM signal.

DmitriGekhtman commented 3 years ago

Or is it actually the main operator process that got cut off?

I missed some of the above context, but is it correct that this is happening when the Ray head gets overloaded?

When this happens does the operator deployment indicate that there were restarts? kubectl -n <namespace> get deployment ray-operator If not, could you share the operator pod's logs kubectl -n <namespace> logs ray-operator-xxxx

My interpretation is that the Ray head pod is getting overloaded (don't know why that is happening). Then perhaps the head gets killed by Kubernetes leading to an (expected) autoscaler failure and then an (unexpected) operator failure.

davidberenstein1957 commented 3 years ago

I tested 3 set-ups with different images for the operator pods and workers/heads (nightlly and release==1.3.0. I used the same ray version for the API as I did for the worker/head when connecting. Also I used the release notation for the serve deployment and not the @serve.deployment() one.

custem docker as suggestby by @edoakes DockerfileCPU.txt

deploymeny YAML ray.cluster.yml.txt ray.dashboard.yml.txt ray.operator.yml.txt

API startup main.py.txt

operator pod with nightly, worker/head with release dashboard and headnode services are available, wont scale after adding too much resources. Also doesn't change when changing the cluster.yml; api-logs.txt head-logs.txt operator-logs.txt
operator pod with release, worker/head with release only headnode services is available, wont connect operator-logs.txt api-logs.txt headnode didnt have any logs.
operator pod with nightly, worker/head with nightly dashboard and headnode services are available, wont scale after adding too much resources. Also doesn't change when changing the cluster.yml; api-logs.txt head-logs.txt operator-logs.txt

davidberenstein1957 commented 3 years ago

The ray head does get overloaded when using the @serve.deployment decorator by the way, but have only found time to test with the old notation due to the fact that I also wanted to test the release images.

davidberenstein1957 commented 3 years ago

Also, when using the @serve.deployment decorator along with the nightly image for the operator, head and workers, the deployment failures get in an infinite loop resulting in the logs shown underneath. With the serve.create_backend() and serve.create_endpoint() approach, this does not happen.

<!--StartFragment-->
&nbsp; | 2021-05-28 05:23:06,394    INFO backend_state.py:773 -- Adding 1 replicas to backend 'DutchSentiment'.
-- | --
&nbsp; | 2021-05-28 05:23:06,625    ERROR controller.py:121 -- Exception updating backend state: Failed to look up actor with name 'HSzhnk:SERVE_CONTROLLER_ACTOR:DutchSentiment#gCitXs'. You are either trying to look up a named actor you didn't create, the named actor died, or the actor hasn't been created because named actor creation is asynchronous.
&nbsp; | 2021-05-28 05:23:06,833    WARNING backend_state.py:864 -- Replica DutchSentiment#gCitXs of backend DutchSentiment failed health check, stopping it.

<!--EndFragment-->

edoakes commented 1 year ago

Stale, reopen if still an issue

ray-project / ray

Issue on @serve.deployment class with FastAPI deployment and module imports #15632