vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.2k stars 3.13k forks source link

tensor parallel question on multi-GPUs and API deployment issue #610

Closed BaileyWei closed 3 months ago

BaileyWei commented 11 months ago

I tried this NCCL_IGNORE_DISABLED_P2P=1 CUDA_VISIBLE_DEVICES=1,5,6,7 python vllm_inference.py --model_name model_hub/llama2-7B-chat-hf --tp_size 4 and found that the memory usage on GPUs was like this image This is just a 7B model, did I apply TP strategy incorrectly?

Also, I'm very confused that the inference can work on a single GPU actually, but when I try to start an API using llmengine,it always stuck in the process reporting blocks of gpu and cpu like this, image that's the reason why I tried tensor parallel in API deployment.

nootums commented 11 months ago

Hey did you find a fix? I'm facing the same issue.

image

I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.

BaileyWei commented 11 months ago

Hey did you find a fix? I'm facing the same issue. image I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.

Not yet, did u try to load the model by using the LLM Engine? Can I have a look at ur command and inference code?

nootums commented 11 months ago

Hey did you find a fix? I'm facing the same issue. image I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.

Not yet, did u try to load the model by using the LLM Engine? Can I have a look at ur command and inference code?

image

This is how I'm loading the model. I downloaded the model from huggingface and all the model files, config and tokenizer is present in the directory location passed in the command.

zhuohan123 commented 11 months ago

@BaileyWei Hey thank you for reporting this. Is it possible for you to locate which line does the code stuck on? You can try to insert prints in the code and see where does the code stuck on.

batindfa commented 10 months ago

@zhuohan123 @nootums @BaileyWei @tmm1 @zxdvd Hi.An error to inference Llama-70b-chat using 16 gpus. The gpus is in one machine. https://github.com/vllm-project/vllm/issues/930 Do you have some ideas,thanks!

hmellor commented 3 months ago

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.

man2machine commented 2 months ago

I'm finding a similar issue, as in the engine hangs on initialization when using multiple GPUs using tensor parallel, even though there are GPUs available.