[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0.

ray-project / ray-llm

RayLLM - LLMs on Ray

https://aviary.anyscale.com

Apache License 2.0

1.2k stars 87 forks source link

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

Open JGSweets opened 5 months ago

JGSweets commented 5 months ago

I'm stuck in a repeat deployment loop when utilizing the image anyscale/ray-llm:latest on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.

This caused a repeated loop for deploying and shutting down workers. Possibly due to the CUDA updates, but I'm not 100% sure?

anyscale/ray-llm:0.4.0 launches as expected with no configuration changes.

sihanwang41 commented 5 months ago

Hi, please provide repo step if possible, so that our team can help to take a look!

JGSweets commented 5 months ago

Update config to match requirements of my AWS env.
- SGs
- region
- updated gpu_worker_g5 to include CPU and GPU values.
Deploy via Ray up
Use Ray attach.
Use rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
- continuous loop on deploy.

JGSweets commented 5 months ago

I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?

JGSweets commented 5 months ago

@sihanwang41 an update on investigating this issue?

JGSweets commented 5 months ago

FWIW, ray-llm is not deployable in the current state on images >= 0.5.0. This is not limited to g5.12xlarges.

SamComber commented 4 months ago

+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)