Open golemsentience opened 2 months ago
I'm using vllm as the serving vllm serving And run inference using Ray serve, here is a sample script: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
And just make it as a Ray serve like:
@serve.deployment # (num_replicas=1 ,ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class VLLMPredictDeployment():
def __init__(self, **kwargs):
What does ray status
say?
Has anyone had any success serving llms through the 0.5.0 docker image?
I have followed the following steps:
cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}
docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=/tmp/data -v $cache_dir:/home/user/data anyscale/ray-llm:0.5.0 bash
I have reconfigured the .yaml with
accelerator_type_T4
ray start --head --dashboard-host=0.0.0.0 --num-cpus 48 --num-gpus 4 --resources{"accelerator_type_T4": 4}'
serve run ~/serve_configs/amazon--LightGPT.yaml
It runs, but I get a
"Deployment 'VLLMDeployment: amazon--LightGPT' in application 'ray-llm' has 2 replicas that have taken more than 30s to initialize. This may be caused by a slow init or reconfigure method.
From here, nothing happens. I've let it run for up to a couple of hours, it just seems to hang up here.
Any success working around these issues?