Open JGSweets opened 5 months ago
Hi, please provide repo step if possible, so that our team can help to take a look!
gpu_worker_g5
to include CPU
and GPU
values.rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?
@sihanwang41 an update on investigating this issue?
FWIW, ray-llm is not deployable in the current state on images >= 0.5.0
. This is not limited to g5.12xlarges.
+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)
I'm stuck in a repeat deployment loop when utilizing the image
anyscale/ray-llm:latest
on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.This caused a repeated loop for deploying and shutting down workers. Possibly due to the CUDA updates, but I'm not 100% sure?
anyscale/ray-llm:0.4.0
launches as expected with no configuration changes.