Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1

hxer7963 commented 8 months ago

Issue Description:

When I tried to deploy the llama-hf-65B model on an 8-GPU machine, I followed the example in Distributed Inference and Serving (link) and wrote the following code:

from vllm import LLM
llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4)

However, Ray raised an OOM exception, as shown in the attached image. Note that setting tensor_parallel_size=8 results in the same exception.

vllm_ray

Even when I replaced the model_dir with the llama-13B model, setting tensor_parallel_size=8 still triggers a Ray OOM exception.

When I set the model directory to llama-13B and tensor_parallel_size=4, the model sometimes can loads and infers successfully. However, it takes a considerable amount of time for initializing the Ray environment and paged attention memory, and it's uncertain whether the program is stuck.

Here is information about my local environment:

ubuntu 22.04
Driver Version: 470.182.03 CUDA Version: 12.3
8xA800 with 80GB on local machine
Python 3.8.18
transformers: 4.38.2
vllm: 0.3.3

jony0113 commented 8 months ago

in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works

hmellor commented 8 months ago

After running llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4), what is the output of ray logs raylet.out -ip 192.168.129.36? (as suggested in the error in the image you uploaded)

hxer7963 commented 8 months ago

in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works

I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.

laishuzhong commented 7 months ago

in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works

I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.

the same error just like you

jony0113 commented 7 months ago

in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works

I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.

I finally found out that in my case, it was the speed limit of disk random access that stuck the loading, you may check your io pressure, hope this will help

DarkLight1337 commented 4 months ago

We have added documentation for this situation in #5430. Please take a look.

vllm-project / vllm

Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1 #3196