Repeated Failure and Restarting of Worker Deployment

njgerner commented 1 year ago

Description:

I've been working with the Aviary codebase, successfully deploying the stack and connecting to it following the instructions in the readme. However, I've run into an issue during the model deployment phase.

Specifically, when deploying a model, I'm experiencing consistent failure and restarting of the worker deployment. Upon investigation, it appears that this could be related to the worker downloading the anyscale/aviary:latest Docker image, which is based on the rayproject/ray-ml:nightly-gpu image. This image is quite large at 20GB, which seems excessive, and I suspect it's causing a substantial delay during download. This may be exacerbated by AWS IP throttling.

The readme documentation doesn't seem to cover this scenario. I have some experience deploying various Ray infra stack items but don't have a ton of insight into how timeouts are handled at different parts of the stack. Is this a known issue? If so, I would appreciate any advice or guidance for resolving it.

Looking forward to your input and suggestions on how to proceed.

Related to:

Model deployment process from head server
Docker images: anyscale/aviary:latest and rayproject/ray-ml:nightly-gpu
AWS IP throttling (I think???)

Steps to reproduce:

Clone and install aviary repo as per the readme instructions.
Follow steps for deploying the stack and connecting to it.
Deploy a model.

Expected behavior:

The model deployment should work without the worker deployment failing and restarting repeatedly.

Actual behavior:

The worker deployment fails and restarts repeatedly. This seems to occur when downloading the Docker image anyscale/aviary:latest.

More Info:

AWS: us-east-1
Aviary version: 0.0.1
Ray version: 2.4.0

Yard1 commented 1 year ago

Hey @njgerner, thanks for the report! Let me try to reproduce and get back to you. In the mean time, can you see if adding pg_timeout_s: [replace with a very large integer] to scaling_config in model yaml files helps?

Yard1 commented 1 year ago

Also, slimming down the aviary image is something that's on our to-do list.

PicoCreator commented 1 year ago

Would like to add, that I encounter the same behaviour - adding pg_timeout_s to the models does not seem to help, as it seems like the instances themself are timing out (before the models). And being stuck in pending phase

If there is a param i can add to aviary-cluster.yml to increase the timeout, that would be great.

Yard1 commented 1 year ago

Got it, thanks. Will spend some time on this today.

Yard1 commented 1 year ago

This should be now fixed in 54d0ebbe7f175d6601fdf05c43ca9c0f17bd612a . The docker image was also reduced in size (though it is still rather large). Please reopen the issue if it persists!

ray-project / ray-llm

Repeated Failure and Restarting of Worker Deployment #5