[X] I have read the README and searched the existing issues.
System Info
CUDA_VISIBLE_DEVICES=1 API_PORT=8000 llamafactory-cli api --model_name_or_path /hdd/pingchuan/Qwen2.5-7B-Instruct --adapter_name_or_path /hdd/pingchuan/lora/sft --template qwen --finetuning_type lora --infer_backend vllm
我使用这个脚本进行vllm GPU1推理,显卡推理却使用GPU0推理
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 0 has a total capacity of 79.25 GiB of which 334.06 MiB is free. Process 627536 has 0 bytes memory in use. Process 627564 has 0 bytes memory in use. Process 308335 has 41.09 GiB memory in use. Including non-PyTorch memory, this process has 32.68 GiB memory in use. Of the allocated memory 32.16 GiB is allocated by PyTorch, and 24.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
vllm version:0.6.2
llamafactory是最新版本
Reproduction
CUDA_VISIBLE_DEVICES=1 API_PORT=8000 llamafactory-cli api --model_name_or_path /hdd/pingchuan/Qwen2.5-7B-Instruct --adapter_name_or_path /hdd/pingchuan/lora/sft --template qwen --finetuning_type lora --infer_backend vllm
我使用这个脚本进行vllm GPU1推理,显卡推理却使用GPU0推理
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 0 has a total capacity of 79.25 GiB of which 334.06 MiB is free. Process 627536 has 0 bytes memory in use. Process 627564 has 0 bytes memory in use. Process 308335 has 41.09 GiB memory in use. Including non-PyTorch memory, this process has 32.68 GiB memory in use. Of the allocated memory 32.16 GiB is allocated by PyTorch, and 24.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Reminder
System Info
CUDA_VISIBLE_DEVICES=1 API_PORT=8000 llamafactory-cli api --model_name_or_path /hdd/pingchuan/Qwen2.5-7B-Instruct --adapter_name_or_path /hdd/pingchuan/lora/sft --template qwen --finetuning_type lora --infer_backend vllm 我使用这个脚本进行vllm GPU1推理,显卡推理却使用GPU0推理 [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 0 has a total capacity of 79.25 GiB of which 334.06 MiB is free. Process 627536 has 0 bytes memory in use. Process 627564 has 0 bytes memory in use. Process 308335 has 41.09 GiB memory in use. Including non-PyTorch memory, this process has 32.68 GiB memory in use. Of the allocated memory 32.16 GiB is allocated by PyTorch, and 24.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
vllm version:0.6.2 llamafactory是最新版本
Reproduction
CUDA_VISIBLE_DEVICES=1 API_PORT=8000 llamafactory-cli api --model_name_or_path /hdd/pingchuan/Qwen2.5-7B-Instruct --adapter_name_or_path /hdd/pingchuan/lora/sft --template qwen --finetuning_type lora --infer_backend vllm 我使用这个脚本进行vllm GPU1推理,显卡推理却使用GPU0推理 [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 0 has a total capacity of 79.25 GiB of which 334.06 MiB is free. Process 627536 has 0 bytes memory in use. Process 627564 has 0 bytes memory in use. Process 308335 has 41.09 GiB memory in use. Including non-PyTorch memory, this process has 32.68 GiB memory in use. Of the allocated memory 32.16 GiB is allocated by PyTorch, and 24.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
vllm version:0.6.2 llamafactory是最新版本
Expected behavior
No response
Others
No response