Running ray-llm 0.5.0 on g4dn.12xlarge instance

golemsentience commented 2 months ago

Has anyone had any success serving llms through the 0.5.0 docker image?

I have followed the following steps:

cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=/tmp/data -v $cache_dir:/home/user/data anyscale/ray-llm:0.5.0 bash

I have reconfigured the .yaml with

accelerator_type_T4

ray start --head --dashboard-host=0.0.0.0 --num-cpus 48 --num-gpus 4 --resources{"accelerator_type_T4": 4}'

serve run ~/serve_configs/amazon--LightGPT.yaml

It runs, but I get a

"Deployment 'VLLMDeployment: amazon--LightGPT' in application 'ray-llm' has 2 replicas that have taken more than 30s to initialize. This may be caused by a slow init or reconfigure method.

From here, nothing happens. I've let it run for up to a couple of hours, it just seems to hang up here.

Any success working around these issues?

nkwangleiGIT commented 2 months ago

I'm using vllm as the serving vllm serving And run inference using Ray serve, here is a sample script: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py

And just make it as a Ray serve like:

@serve.deployment # (num_replicas=1 ,ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class VLLMPredictDeployment():
    def __init__(self, **kwargs):

teocns commented 2 months ago

What does ray status say?

ray-project / ray-llm

Running ray-llm 0.5.0 on g4dn.12xlarge instance #150