Closed nelson-liu closed 4 months ago
maybe you can try insert os.system("ray stop --force")
somewhere between unload and reload
same problem, any solution?
I encountered the same issue. It runs fine when I use tensor_parallel_size=1
, but it hangs when I use tensor_parallel_size>1
. I have tried reinstalling many times but it didn't help.
The final solution for me was to modify the vllm/engine/ray_utils.py
file and limit the number of CPUs used. After making this change, it works properly. The modified code is:
ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).
Note: I encountered hanging issues while using tensor_parallel_size>1
on a 128-core machine. However, running tensor_parallel_size>1
on a 96-core machine works normally
@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP
Hi @Fenkail , I already modified the "ray_utils.py" as you suggested but the problem is still there.
In fact, my pc has only two GPUs. So, I'd like to know how you choose num_cpus and num_gpus to fix the problem?
@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP
I just tried using 32 cores and it solved my problem. The specific number of CPU cores can be adjusted according to your needs. It was working fine on a machine with 96 cores, but I encountered issues on a 128-core machine, so I thought of limiting the CPU usage.
ray_utils
Did you modify the ray_utils.py installed in the conda environment for vllm?
Yes, I did modify ray_utils.py, installed in my conda environment for vllm
Hit the exact same issue when running vLLM in Ray serve.
In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.
Resources
---------------------------------------------------------------
Usage:
17.0/48.0 CPU
4.0/4.0 GPU
0B/104.83GiB memory
44B/48.92GiB object_store_memory
Demands:
{'GPU': 1.0} * 2 (PACK): 1+ pending placement groups
you should load model outside the function to keep model only load once
from vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2, trust_remote_code=True, load_format="pt")
def process_prompts(prompts): sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500) return llm.generate(prompts, sampling_params)
prompt_batch_1 = ["Hello, my name is", "The president of the United States is"] prompt_batch_2 = ["The capital of France is", "The future of AI is"]
batch_1_output = process_prompts(prompt_batch_1) batch_2_output = process_prompts(prompt_batch_2)
TLDR: Don't set the num_gpus
value for vLLM, only set tensor_parallel_size
.
I encountered the same problem, and here's what I found out:
According to Ray's documentation, the framework itself will allocate the necessary GPUs (based on num_gpus
) and set the CUDA_VISIBLE_DEVICES
value. When I started running vLLM through Ray, I found that Ray sets the CUDA_VISIBLE_DEVICES
value to 0,1,2,3
(I had num_gpus = 4
specified), however, when I called nvidia-smi, I found that vLLM uses 4,5,6,7
. Therefore, vLLM ignores the CUDA_VISIBLE_DEVICES
value and chooses other devices.
In my case, I have 8 GPUs, I allocated 4 for vLLM, and 2 for another model, leaving 2 free. But vLLM requested another 4 GPUs, and since Ray couldn't satisfy this request, vLLM started waiting for the GPUs to free up. As soon as I removed the second model, which requested 2 GPUs, everything started working. Everything also worked when I allocated 2 GPUs for the vLLM model.
If you try to run two identical applications using vLLM through Ray (in one serve
), everything will break. The applications will not use different GPUs, but will start loading data into the memory of the same GPUs, while the other GPUs will be idle. Ultimately, this will lead to OOM. I believe this is related to the incorrect handling of the num_gpus
value.
I just tried using 32 cores and it solved my problem.
I am still having the problem I want to deploy one model with tensor_parallel_size=2 (just 1 replica), one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) In total, this would require 2.9 GPUs which is ok since I have 3 GPUs each with sufficient VRAM on this node alone at hand.
ray status
returns
Resources
---------------------------------------------------------------
Usage:
10.0/24.0 CPU
0.8999999999999999/4.0 GPU
0B/200.20GiB memory
44B/89.79GiB object_store_memory
Demands:
{'CPU': 12.0}: 1+ pending tasks/actors
serve status
returns
message: 'Deployment ''vllmAPI'' in application ''ray vllm application''
1 replicas that have taken more than 30s to be scheduled. This may be due
to waiting for the cluster to auto-scale or for a runtime environment to
be installed. Resources required for each replica: {"CPU": 12.0}, total
resources available: {"CPU": 14.0}. Use `ray status` for more details.'
Edit: Problem solved. As adviced above, for the one model with tensor_parallel_size=2, I defined num_gpus=0 For the others, as written above, one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) BUT at the same time, I also defined for these others CUDA_VISIBLE_DEVICES=0,1,2,3 (since I have 4 GPUs) and then it was able to spin up properly
Fenkail's solution for setting the 'num_cpus' parameter up to a correct amount (i.e. 10 out of 10 available in my case) solved my problem. A fix for slurm jobs:
num_cpus = int(os.environ.get('SLURM_CPUS_PER_TASK'))
I also fix the problem by setting the ray num cpus to 32.
ray start --head --num-cpus=32
It also works when I set cpus to 49 (since I have two pysical cpus, each have 48 cores)
Hey folks had a similar issue, I'm running with offline inference mode. I was able to clear the resource with ray stop
But when I try to reload the resource I got
[2024-05-20 22:14:06,214 E 1068826 1069198] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate.
Does anyone know how to properly restart?
Hi @DarkLight1337, is there any update for the bug ? I have also the same problem when reload a model in api infrence.
Firstly, when I run api code, everything is fine, loading is ok.
If I try directly reload a model, I get:
2024-06-05 16:32:47,026 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:32:47,035 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:32:47,035 INFO worker.py:1582 -- Calling ray.init() again after it has already been
called.`
And nothing happens.
If I check ray status and shutdown the ray cluster and reload a model, I get:
if ray.is_initialized():
ray.shutdown()
new_model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER)
2024-06-05 16:39:43,757 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:39:43,766 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:39:43,766 INFO worker.py:1582 -- Calling ray.init() again after it has already been called.
INFO 06-05 16:39:44 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: .......
It seems it connects and loads the model again but it does not load and gets this error:
(RayWorkerWrapper pid=3562842) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.187.##.##:##..., 44163)
I was just triaging the issues. I'm not that involved with the use of Ray in vLLM so I won't be of much assistance here.
We have added documentation for this situation in #5430. Please take a look.
I'd like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang (disclaimer: this isn't my particular workload, but a minimal reproducible example):
Results in:
Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.
I'm fairly sure this is related to ray, since this doesn't happen if tensor parallelism is set to 1 (e.g., if you're running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it's stuck on
ray.get(current_placement_group.ready(), timeout=1800)
https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .Is there any way to "reset" the ray state, such that it initializes from scratch the second time?