Closed njgerner closed 1 year ago
Hey @njgerner, thanks for the report! Let me try to reproduce and get back to you. In the mean time, can you see if adding pg_timeout_s: [replace with a very large integer]
to scaling_config
in model yaml files helps?
Also, slimming down the aviary image is something that's on our to-do list.
Would like to add, that I encounter the same behaviour - adding pg_timeout_s to the models does not seem to help, as it seems like the instances themself are timing out (before the models). And being stuck in pending phase
If there is a param i can add to aviary-cluster.yml to increase the timeout, that would be great.
Got it, thanks. Will spend some time on this today.
This should be now fixed in 54d0ebbe7f175d6601fdf05c43ca9c0f17bd612a . The docker image was also reduced in size (though it is still rather large). Please reopen the issue if it persists!
Description:
I've been working with the Aviary codebase, successfully deploying the stack and connecting to it following the instructions in the readme. However, I've run into an issue during the model deployment phase.
Specifically, when deploying a model, I'm experiencing consistent failure and restarting of the worker deployment. Upon investigation, it appears that this could be related to the worker downloading the
anyscale/aviary:latest
Docker image, which is based on therayproject/ray-ml:nightly-gpu
image. This image is quite large at 20GB, which seems excessive, and I suspect it's causing a substantial delay during download. This may be exacerbated by AWS IP throttling.The readme documentation doesn't seem to cover this scenario. I have some experience deploying various Ray infra stack items but don't have a ton of insight into how timeouts are handled at different parts of the stack. Is this a known issue? If so, I would appreciate any advice or guidance for resolving it.
Looking forward to your input and suggestions on how to proceed.
Related to:
anyscale/aviary:latest
andrayproject/ray-ml:nightly-gpu
Steps to reproduce:
Expected behavior:
The model deployment should work without the worker deployment failing and restarting repeatedly.
Actual behavior:
The worker deployment fails and restarts repeatedly. This seems to occur when downloading the Docker image
anyscale/aviary:latest
.More Info:
us-east-1
0.0.1
2.4.0