Open aabbccddwasd opened 2 weeks ago
Vllm has to have available memory for the entire KV cache which for Qwen 2 is 32k. Try setting max-model-len to lower. Also, set gpu_memory_utilization to 0.98. The value is the percentage of total GPU memory it's allowed to use. It needs 18 GB/GPU for the model alone, it looks like it's seeing that it's using more than 2GB and throws an error.
OK,looks like max-model-len is working, but how can I do this with LLM(), it doesn't have a parameter called "max_model_len" and "max_seq_len_to_capture" don't work
You should be able to set max_model_len
for LLM
You should be able to set
max_model_len
for LLM
OK I succeeded, I know the problem:max_model_len in **kwargs is not shown in pycharm
I also noticed a problem, when I use python -m vllm.entrypoints.openai.api_server, I can set max_model_len to 8700, but the maximum max_model_len I can set in LLM() is 8200, the other parameters is same this is my code
llm = LLM(model="../obj013-qwen/Qwen2-72B-Instruct-GPTQ-Int8",
tensor_parallel_size=4,
gpu_memory_utilization=1,
enforce_eager=True,
max_model_len=8200)
the command
python -m vllm.entrypoints.openai.api_server --model ./Qwen2-72B-Instruct-GPTQ-Int8 --tensor-parallel-size=4 --gpu-memory-utilization 1 --max-model-len 8700 --enforce-eager
Your current environment
🐛 Describe the bug
I set gpu_memory_utilization to 0.1 but before loading weights vllm already consumes 18.7G VRAM, near to 22GB * 0.9 = 19.8G, than when the weights were loaded ,22GB 2080ti got OOM
I set gpu_memory_utilization to 0.1 but before loading weights vllm already consumes 18.7G VRAM, near to 22GB * 0.9 = 19.8G, than when the weights were loaded ,22GB 2080ti got OOM, I also tried
python -m vllm.entrypoints.openai.api_server --model ../obj013-qwen/Qwen2-72B-Instruct-GPTQ-Int8 --tensor-parallel-size=4 --gpu-memory-utilization 0.1
but it didn't work as wellMaybe it is not related to gpu_memory_utilization but when I load AWQ model and modify gpu_memory_utilization the VRAM actually consumed didn't change.So I guessed the problem is caused by gpu_memory_utilization
First it outputed this, than all of the output is OOM error
WARNING 06-30 18:37:15 config.py:217] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2024-06-30 18:37:15,586 INFO worker.py:1586 -- Connecting to existing Ray cluster at address: 192.168.3.123:6379... 2024-06-30 18:37:15,600 INFO worker.py:1771 -- Connected to Ray cluster. INFO 06-30 18:37:15 config.py:623] Defaulting to use mp for distributed inference WARNING 06-30 18:37:15 config.py:437] Possibly too large swap space. 64.00 GiB out of the 125.75 GiB total CPU memory is allocated for the swap space. INFO 06-30 18:37:15 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='../obj013-qwen/Qwen2-72B-Instruct-GPTQ-Int8', speculative_config=None, tokenizer='../obj013-qwen/Qwen2-72B-Instruct-GPTQ-Int8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=../obj013-qwen/Qwen2-72B-Instruct-GPTQ-Int8) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-30 18:37:16 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 06-30 18:37:16 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332366) INFO 06-30 18:37:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332366) INFO 06-30 18:37:18 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332367) INFO 06-30 18:37:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332367) INFO 06-30 18:37:18 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332365) INFO 06-30 18:37:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332365) INFO 06-30 18:37:18 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332366) INFO 06-30 18:37:19 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=332365) INFO 06-30 18:37:20 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=332367) INFO 06-30 18:37:20 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=332366) INFO 06-30 18:37:20 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=332365) INFO 06-30 18:37:20 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=332367) INFO 06-30 18:37:20 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=332365) INFO 06-30 18:37:20 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=332366) INFO 06-30 18:37:20 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=332367) INFO 06-30 18:37:20 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 06-30 18:37:20 utils.py:637] Found nccl from library libnccl.so.2 INFO 06-30 18:37:20 pynccl.py:63] vLLM is using nccl==2.20.5 Traceback (most recent call last): File "/home/aabbccddwasd/.conda/envs/obj013-env-vllm/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_15339bca' Traceback (most recent call last): File "/home/aabbccddwasd/.conda/envs/obj013-env-vllm/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_15339bca' Traceback (most recent call last): File "/home/aabbccddwasd/.conda/envs/obj013-env-vllm/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_15339bca' WARNING 06-30 18:37:20 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=332366) WARNING 06-30 18:37:20 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=332365) WARNING 06-30 18:37:20 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=332367) WARNING 06-30 18:37:20 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 06-30 18:37:20 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 06-30 18:37:20 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332366) INFO 06-30 18:37:20 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332366) INFO 06-30 18:37:20 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332367) INFO 06-30 18:37:20 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332367) INFO 06-30 18:37:20 selector.py:51] Using XFormers backend. (VllmWorkerProcess pid=332365) INFO 06-30 18:37:20 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=332365) INFO 06-30 18:37:20 selector.py:51] Using XFormers backend. INFO 06-30 18:37:30 model_runner.py:160] Loading model weights took 17.9828 GB (VllmWorkerProcess pid=332367) INFO 06-30 18:37:30 model_runner.py:160] Loading model weights took 17.9828 GB (VllmWorkerProcess pid=332365) INFO 06-30 18:37:30 model_runner.py:160] Loading model weights took 17.9828 GB (VllmWorkerProcess pid=332366) INFO 06-30 18:37:30 model_runner.py:160] Loading model weights took 17.9828 GB
before loading weight it already consumed 18.7G VRAM ↓
Please help me with this problem, thanks,,,