Closed LIUKAI0815 closed 1 month ago
vllm 0.6.0
Does this segmentation fault occur when disabling tensor parallel?
cc @Isotr0py since it may be related to GGUF loading
Does this segmentation fault occur when disabling tensor parallel?
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 13.75 MiB is free. Process 2509972 has 23.63 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 145.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Looks like the model is too big to load inside 1 GPU. Is there a smaller version that is easier to test with?
Mistral-Large-Instruct-2407-IQ1_M.gguf 1bit Already the smallest
Seems that the model has been loaded to GPU successfully:
INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB
Perhaps related to problematic model forwarding due to gguf config extraction instead. (Maybe caused by calling kernel like rotary_embeddings
or page_attention
)
Oh, it's because the gguf kernel we ported is out of date which didn't include IQ1_M
implementation. I will add it back.
Your current environment
import os os.environ["CUDA_VISIBLE_DEVICES"]="4,5,6,7" # 4090*4
from transformers import AutoTokenizer from vllm import LLM, SamplingParams import time import uvicorn from fastapi import FastAPI,Body from pydantic import BaseModel import asyncio
apps = FastAPI()
path = "/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf" sampling_params = SamplingParams(temperature=1.0,repetition_penalty=1.0,max_tokens=512)
Create an LLM.
llm = LLM(model=path,tokenizer="/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/",trust_remote_code=True,gpu_memory_utilization=0.8,tensor_parallel_size=4,enforce_eager=True,disable_custom_all_reduce=True)
🐛 Describe the bug
WARNING 09-10 15:07:35 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-10 15:07:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2 INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-10 15:07:35 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fed8f24fee0>, local_subscribe_port=38065, remote_subscribe_port=None) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf... (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf... INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB /root/miniconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' 段错误 (核心已转储)
Before submitting a new issue...