vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.65k stars 4.47k forks source link

[Bug]: 段错误 (核心已转储) #8321

Closed LIUKAI0815 closed 1 month ago

LIUKAI0815 commented 1 month ago

Your current environment

import os os.environ["CUDA_VISIBLE_DEVICES"]="4,5,6,7" # 4090*4

from transformers import AutoTokenizer from vllm import LLM, SamplingParams import time import uvicorn from fastapi import FastAPI,Body from pydantic import BaseModel import asyncio

apps = FastAPI()

path = "/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf" sampling_params = SamplingParams(temperature=1.0,repetition_penalty=1.0,max_tokens=512)

Create an LLM.

llm = LLM(model=path,tokenizer="/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/",trust_remote_code=True,gpu_memory_utilization=0.8,tensor_parallel_size=4,enforce_eager=True,disable_custom_all_reduce=True)

🐛 Describe the bug

WARNING 09-10 15:07:35 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-10 15:07:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2 INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 09-10 15:07:35 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fed8f24fee0>, local_subscribe_port=38065, remote_subscribe_port=None) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf... (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf... INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB (VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB /root/miniconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' 段错误 (核心已转储)

Before submitting a new issue...

LIUKAI0815 commented 1 month ago

vllm 0.6.0

DarkLight1337 commented 1 month ago

Does this segmentation fault occur when disabling tensor parallel?

DarkLight1337 commented 1 month ago

cc @Isotr0py since it may be related to GGUF loading

LIUKAI0815 commented 1 month ago

Does this segmentation fault occur when disabling tensor parallel?

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 13.75 MiB is free. Process 2509972 has 23.63 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 145.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

DarkLight1337 commented 1 month ago

Looks like the model is too big to load inside 1 GPU. Is there a smaller version that is easier to test with?

LIUKAI0815 commented 1 month ago

Mistral-Large-Instruct-2407-IQ1_M.gguf 1bit Already the smallest

Isotr0py commented 1 month ago

Seems that the model has been loaded to GPU successfully:

INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB

Perhaps related to problematic model forwarding due to gguf config extraction instead. (Maybe caused by calling kernel like rotary_embeddings or page_attention)

Isotr0py commented 1 month ago

Oh, it's because the gguf kernel we ported is out of date which didn't include IQ1_M implementation. I will add it back.