nelson-liu commented 1 year ago

I'd like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang (disclaimer: this isn't my particular workload, but a minimal reproducible example):

from vllm import LLM, SamplingParams

def process_prompts(prompts):
    llm = LLM(
        model="meta-llama/Llama-2-70b-chat-hf",
        tensor_parallel_size=2,
        trust_remote_code=True,
        load_format="pt")
    sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
    return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

Results in:

2023-09-15 11:43:25,943 INFO worker.py:1621 -- Started a local Ray instance.
INFO 09-15 11:43:51 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2
-70b-chat-hf', tokenizer='meta-llama/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=True,
 dtype=torch.float16, download_dir='/scr/biggest/nfliu/cache/huggingface/', load_format=pt, tensor_para
llel_size=2, seed=0)
INFO 09-15 11:43:51 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may t
ake a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokeni
zer' instead of the original tokenizer.
INFO 09-15 11:45:58 llm_engine.py:199] # GPU blocks: 2561, # CPU blocks: 1638
Processed prompts: 100%|█████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.17s/it]
2023-09-15 11:46:28,348 INFO worker.py:1453 -- Calling ray.init() again after it has already been called.

Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   30C    P0              61W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0              57W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I'm fairly sure this is related to ray, since this doesn't happen if tensor parallelism is set to 1 (e.g., if you're running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it's stuck on ray.get(current_placement_group.ready(), timeout=1800) https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .

Is there any way to "reset" the ray state, such that it initializes from scratch the second time?

hsm1997 commented 1 year ago

maybe you can try insert os.system("ray stop --force") somewhere between unload and reload

raihan0824 commented 1 year ago

same problem, any solution?

Fenkail commented 12 months ago

I encountered the same issue. It runs fine when I use tensor_parallel_size=1, but it hangs when I use tensor_parallel_size>1 . I have tried reinstalling many times but it didn't help.

The final solution for me was to modify the vllm/engine/ray_utils.py file and limit the number of CPUs used. After making this change, it works properly. The modified code is: ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).

Note: I encountered hanging issues while using tensor_parallel_size>1 on a 128-core machine. However, running tensor_parallel_size>1 on a 96-core machine works normally

yichenjm commented 11 months ago

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

pvtoan commented 11 months ago

Hi @Fenkail , I already modified the "ray_utils.py" as you suggested but the problem is still there.

In fact, my pc has only two GPUs. So, I'd like to know how you choose num_cpus and num_gpus to fix the problem?

Fenkail commented 11 months ago

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

I just tried using 32 cores and it solved my problem. The specific number of CPU cores can be adjusted according to your needs. It was working fine on a machine with 96 cores, but I encountered issues on a 128-core machine, so I thought of limiting the CPU usage.

Fenkail commented 11 months ago

ray_utils

Did you modify the ray_utils.py installed in the conda environment for vllm?

pvtoan commented 11 months ago

Yes, I did modify ray_utils.py, installed in my conda environment for vllm

qizzzh commented 10 months ago

Hit the exact same issue when running vLLM in Ray serve.

qizzzh commented 10 months ago

In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.

Resources
---------------------------------------------------------------
Usage:
 17.0/48.0 CPU
 4.0/4.0 GPU
 0B/104.83GiB memory
 44B/48.92GiB object_store_memory

Demands:
 {'GPU': 1.0} * 2 (PACK): 1+ pending placement groups

smallmocha commented 9 months ago

you should load model outside the function to keep model only load once

from vllm import LLM, SamplingParams

llm = LLM( model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2, trust_remote_code=True, load_format="pt")

def process_prompts(prompts): sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500) return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"] prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1) batch_2_output = process_prompts(prompt_batch_2)

Dolfik1 commented 9 months ago

TLDR: Don't set the num_gpus value for vLLM, only set tensor_parallel_size.

I encountered the same problem, and here's what I found out:

According to Ray's documentation, the framework itself will allocate the necessary GPUs (based on num_gpus) and set the CUDA_VISIBLE_DEVICES value. When I started running vLLM through Ray, I found that Ray sets the CUDA_VISIBLE_DEVICES value to 0,1,2,3 (I had num_gpus = 4 specified), however, when I called nvidia-smi, I found that vLLM uses 4,5,6,7. Therefore, vLLM ignores the CUDA_VISIBLE_DEVICES value and chooses other devices. In my case, I have 8 GPUs, I allocated 4 for vLLM, and 2 for another model, leaving 2 free. But vLLM requested another 4 GPUs, and since Ray couldn't satisfy this request, vLLM started waiting for the GPUs to free up. As soon as I removed the second model, which requested 2 GPUs, everything started working. Everything also worked when I allocated 2 GPUs for the vLLM model.
If you try to run two identical applications using vLLM through Ray (in one serve), everything will break. The applications will not use different GPUs, but will start loading data into the memory of the same GPUs, while the other GPUs will be idle. Ultimately, this will lead to OOM. I believe this is related to the incorrect handling of the num_gpus value.

hwaking commented 9 months ago

I just tried using 32 cores and it solved my problem.

paolovic commented 7 months ago

I am still having the problem I want to deploy one model with tensor_parallel_size=2 (just 1 replica), one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) In total, this would require 2.9 GPUs which is ok since I have 3 GPUs each with sufficient VRAM on this node alone at hand.

ray status returns

Resources
---------------------------------------------------------------
Usage:
 10.0/24.0 CPU
 0.8999999999999999/4.0 GPU
 0B/200.20GiB memory
 44B/89.79GiB object_store_memory

Demands:
 {'CPU': 12.0}: 1+ pending tasks/actors

serve status returns

        message: 'Deployment ''vllmAPI'' in application ''ray vllm application''
          1 replicas that have taken more than 30s to be scheduled. This may be due
          to waiting for the cluster to auto-scale or for a runtime environment to
          be installed. Resources required for each replica: {"CPU": 12.0}, total
          resources available: {"CPU": 14.0}. Use `ray status` for more details.'

Edit: Problem solved. As adviced above, for the one model with tensor_parallel_size=2, I defined num_gpus=0 For the others, as written above, one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) BUT at the same time, I also defined for these others CUDA_VISIBLE_DEVICES=0,1,2,3 (since I have 4 GPUs) and then it was able to spin up properly

premsa commented 7 months ago

Fenkail's solution for setting the 'num_cpus' parameter up to a correct amount (i.e. 10 out of 10 available in my case) solved my problem. A fix for slurm jobs:

num_cpus = int(os.environ.get('SLURM_CPUS_PER_TASK'))

panxnan commented 6 months ago

I also fix the problem by setting the ray num cpus to 32. ray start --head --num-cpus=32 It also works when I set cpus to 49 (since I have two pysical cpus, each have 48 cores)

shyringo commented 5 months ago

1908 might be related, but in 'Offline Batched Inference' mode.

Vincent-Li-9701 commented 4 months ago

Hey folks had a similar issue, I'm running with offline inference mode. I was able to clear the resource with ray stop But when I try to reload the resource I got [2024-05-20 22:14:06,214 E 1068826 1069198] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate. Does anyone know how to properly restart?

emirhanKural commented 4 months ago

Hi @DarkLight1337, is there any update for the bug ? I have also the same problem when reload a model in api infrence.

Firstly, when I run api code, everything is fine, loading is ok.

If I try directly reload a model, I get:

2024-06-05 16:32:47,026 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread. 2024-06-05 16:32:47,035 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##... 2024-06-05 16:32:47,035 INFO worker.py:1582 -- Calling ray.init() again after it has already beencalled.`

And nothing happens.

If I check ray status and shutdown the ray cluster and reload a model, I get:

if ray.is_initialized():
    ray.shutdown()

new_model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER)

2024-06-05 16:39:43,757 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread. 2024-06-05 16:39:43,766 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##... 2024-06-05 16:39:43,766 INFO worker.py:1582 -- Calling ray.init() again after it has already been called. INFO 06-05 16:39:44 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: .......

It seems it connects and loads the model again but it does not load and gets this error:

(RayWorkerWrapper pid=3562842) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.187.##.##:##..., 44163)

DarkLight1337 commented 4 months ago

I was just triaging the issues. I'm not that involved with the use of Ray in vLLM so I won't be of much assistance here.

DarkLight1337 commented 4 months ago

We have added documentation for this situation in #5430. Please take a look.

vllm-project / vllm

vllm hangs when reinitializing ray #1058

1908 might be related, but in 'Offline Batched Inference' mode.