Closed BaileyWei closed 3 months ago
Hey did you find a fix? I'm facing the same issue.
I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.
Hey did you find a fix? I'm facing the same issue. I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.
Not yet, did u try to load the model by using the LLM Engine? Can I have a look at ur command and inference code?
Hey did you find a fix? I'm facing the same issue. I'm trying to load llama-2-7b-chat-hf on 4 GPUs and VRAM usage is stuck at this.
Not yet, did u try to load the model by using the LLM Engine? Can I have a look at ur command and inference code?
This is how I'm loading the model. I downloaded the model from huggingface and all the model files, config and tokenizer is present in the directory location passed in the command.
@BaileyWei Hey thank you for reporting this. Is it possible for you to locate which line does the code stuck on? You can try to insert prints in the code and see where does the code stuck on.
@zhuohan123 @nootums @BaileyWei @tmm1 @zxdvd Hi.An error to inference Llama-70b-chat using 16 gpus. The gpus is in one machine. https://github.com/vllm-project/vllm/issues/930 Do you have some ideas,thanks!
Closing this issue as stale as there has been no discussion in the past 3 months.
If you are still experiencing the issue you describe, feel free to re-open this issue.
I'm finding a similar issue, as in the engine hangs on initialization when using multiple GPUs using tensor parallel, even though there are GPUs available.
I tried this
NCCL_IGNORE_DISABLED_P2P=1 CUDA_VISIBLE_DEVICES=1,5,6,7 python vllm_inference.py --model_name model_hub/llama2-7B-chat-hf --tp_size 4
and found that the memory usage on GPUs was like this This is just a 7B model, did I apply TP strategy incorrectly?Also, I'm very confused that the inference can work on a single GPU actually, but when I try to start an API using llmengine,it always stuck in the process reporting blocks of gpu and cpu like this, that's the reason why I tried tensor parallel in API deployment.