vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.84k stars 4.69k forks source link

[Usage]: How to run VLLM on multiple tpu hosts V4-32 #8582

Open sparsh35 opened 2 months ago

sparsh35 commented 2 months ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

As there is an example for offline inference on TPUs, but it is not utilizing all 4 hosts in v4-32, if I run the code on all hosts , ray detects each hosts TPU resource only, Environment is correct it works for single host but maybe I dont know how to let VLLM detect and use all 4 hosts , I would like to do that for bigger models.

Before submitting a new issue...

youkaichao commented 2 months ago

see https://docs.vllm.ai/en/stable/serving/distributed_serving.html#multi-node-inference-and-serving , you need to set up a ray cluster first.

sparsh35 commented 2 months ago

getting this error @youkaichao docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]. I am trying the docker route, installing with docker , earlier attempt was by handling ray server directly with placement group. I think this script is not configured for TPU. And thanks again for your help.

sparsh35 commented 2 months ago

Docker vllm-tpu image is running image` So , i changed the config for run cluster.sh file as follows . change was in removing gpus all command in docker and adding tpus resources in ray start command as follows

`#!/bin/bash

Check for minimum number of required arguments

if [ $# -lt 4 ]; then echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]" exit 1 fi

Assign the first three arguments and shift them away

DOCKER_IMAGE="$1" HEAD_NODE_ADDRESS="$2" NODE_TYPE="$3" # Should be --head or --worker PATH_TO_HF_HOME="$4" shift 4

Additional arguments are passed directly to the Docker command

ADDITIONAL_ARGS="$@"

Validate node type

if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then echo "Error: Node type must be --head or --worker" exit 1 fi

Define a function to cleanup on EXIT signal

cleanup() { docker stop node docker rm node } trap cleanup EXIT

Command setup for head or worker node

RAY_START_CMD="ray start --block --num-cpus=220 --resources='{\"tpu\": 4}'"

if [ "${NODE_TYPE}" == "--head" ]; then RAY_START_CMD+=" --head --port=6379" else RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379" fi

Run the docker command with the user specified parameters and additional arguments

docker run \ --entrypoint /bin/bash \ --network host \ --name node \ --shm-size 10.24g \ -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \ ${ADDITIONAL_ARGS} \ "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"`

and then ray status shows 16 tpus , but 4 pipeline parallel and 4 tensor parallel wont work , I can use 16 parallel as no of attention heads for model are 28 not divisible by tensor parallel , here is ray status on docker image

sparsh35 commented 2 months ago

So , I have tried both methods in run cluster sh , with adding them in resources as well as deleting the resource file but issue persist and when trying to serve it gives this error as follows image

sparsh35 commented 2 months ago

Debugging using print shows it cant recognize the no of tpus and fails with placement group device assertion, any help would be appreciated it is urgent.

sparsh35 commented 2 months ago

I think it may be related to the tpu environment variables used in GKE or gcloud etc for a pod, like done in

Relevant PR

This is needed for libtpu and TPU driver to know which TPU chip is actually visible. On GKE these need to be set, otherwise the TPU driver will fail to initialize because the number of devices would be different from the number of visible worker addresses.

sparsh35 commented 2 months ago

any ideas @youkaichao

avshalomman commented 4 weeks ago

@sparsh35 did you succeed eventually? I followed your leads and managed to run inference on a v4-16 with the following changes: