Closed hxer7963 closed 4 months ago
in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works
After running llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4)
, what is the output of ray logs raylet.out -ip 192.168.129.36
? (as suggested in the error in the image you uploaded)
in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works
I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.
in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works
I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.
the same error just like you
in my case, it hangs for 40 minutes after update version to 0.3.0, see #2959 , you may try 0.2.7 to check whether it works
I reinstall the vllm with 0.2.7 version, but ray is still hang and stuck.
I finally found out that in my case, it was the speed limit of disk random access that stuck the loading, you may check your io pressure, hope this will help
We have added documentation for this situation in #5430. Please take a look.
Issue Description:
When I tried to deploy the llama-hf-65B model on an 8-GPU machine, I followed the example in Distributed Inference and Serving (link) and wrote the following code:
However, Ray raised an OOM exception, as shown in the attached image. Note that setting
tensor_parallel_size=8
results in the same exception.Even when I replaced the model_dir with the llama-13B model, setting tensor_parallel_size=8 still triggers a Ray OOM exception.
When I set the model directory to llama-13B and
tensor_parallel_size=4
, the model sometimes can loads and infers successfully. However, it takes a considerable amount of time for initializing the Ray environment and paged attention memory, and it's uncertain whether the program is stuck.Here is information about my local environment: