vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.21k stars 4.57k forks source link

[Bug]: Compiling FSM index high memory && subprocess OOM #7332

Open wciq1208 opened 3 months ago

wciq1208 commented 3 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 4 Stepping: 4 BogoMIPS: 4599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke md_clear spec_ctrl intel_stibp arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 64 MiB (16 instances) L3 cache: 64 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS (kernel), IBPB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] intel-extension-for-transformers==1.4.2 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] optree==0.12.1 [pip3] pyzmq==26.0.3 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.0 [pip3] torchelastic==0.2.2 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.3 [pip3] triton==3.0.0 [conda] blas 1.0 mkl [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] intel-extension-for-transformers 1.4.2 pypi_0 pypi [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py311h5eee18b_1 [conda] mkl_fft 1.3.8 py311h5eee18b_0 [conda] mkl_random 1.2.4 py311hdb19cb5_0 [conda] numpy 1.26.4 py311h08b1b3b_0 [conda] numpy-base 1.26.4 py311hf175353_0 [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] optree 0.12.1 pypi_0 pypi [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchaudio 2.4.0 py311_cu121 pytorch [conda] torchelastic 0.2.2 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.43.3 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

server

python -m vllm.entrypoints.openai.api_server --model /model/Qwen1.5-14B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 12720

client

class ClassificationItem(BaseModel):
    name: str = Field(max_length=20, title="分类名")
    risk_level: conint(ge=0, lt=8) = Field(title="风险等级")

class ClassificationSet(BaseModel):
    classification_list: List[ClassificationItem] = Field(min_items=100, title="分类名的列表")

openai_client = OpenAI(
        base_url="http://192.168.91.25:8000/v1",
        api_key="EMPTY",
    )
client = instructor.from_openai(openai_client)

resp = client.chat.completions.create(
        model="/model/Qwen1.5-14B-Chat-GPTQ-Int4",
        messages=[{"role": "user",
                   "content": "你是一名数据安全运营专家,我是一个法律行业的公司,是一家律师事务所,我们公司负责响应客户的法律咨询、帮客户在法庭上辩护,我们公司里有很多机密类型的文件或者文档,请你为我列举一下这些'类型',只需要给出类型名和该类型对应的风险等级,不需要输出json以外的内容"}],
        response_model=ClassificationSet
    )

payload

curl http://192.168.91.25:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "messages": [
        {
            "role": "user",
            "content": "你是一名数据安全运营专家,我是一个法律行业的公司,是一家律师事务所,我们公司负责响应客户的法律咨询、帮客户在法庭上辩护,我们公司里有很多机密类型的文件或者文档,请你为我列举一下这些'类型',只需要给出类型名和该类型对应的风险等级,不需要输出json以外的内容"
        }
    ],
    "model": "/model/Qwen1.5-14B-Chat-GPTQ-Int4",
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "ClassificationSet",
                "description": "Correctly extracted `ClassificationSet` with all the required parameters with correct types",
                "parameters": {
                    "$defs": {
                        "ClassificationItem": {
                            "properties": {
                                "name": {
                                    "maxLength": 20,
                                    "title": "分类名",
                                    "type": "string"
                                },
                                "risk_level": {
                                    "exclusiveMaximum": 8,
                                    "minimum": 0,
                                    "title": "风险等级",
                                    "type": "integer"
                                }
                            },
                            "required": [
                                "name",
                                "risk_level"
                            ],
                            "title": "ClassificationItem",
                            "type": "object"
                        }
                    },
                    "properties": {
                        "classification_list": {
                            "items": {
                                "$ref": "#/$defs/ClassificationItem"
                            },
                            "minItems": 100,
                            "title": "分类名的列表",
                            "type": "array"
                        }
                    },
                    "required": [
                        "classification_list"
                    ],
                    "type": "object"
                }
            }
        }
    ],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "ClassificationSet"
        }
    }
}

20240809-110817

output

.....
Compiling FSM index for all state transitions: 100%|█████████████████████████████████████████████████████████▋| 15753/15831 [15:05<00:04, 16.41it/s]INFO 08-09 03:02:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Compiling FSM index for all state transitions: 100%|██████████████████████████████████████████████████████████| 15831/15831 [15:10<00:00, 17.39it/s]
INFO 08-09 03:02:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:02:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:02:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:03:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:03:28 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:03:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:03:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 03:04:03 logger.py:36] Received request chat-1048cec6489b46c4902ff692c76094a6: prompt: '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\n你是一名数据安全运营专家,我是一个法律行业的公司,是一家律师事务所,我们公司负责响应客户的法律咨询、帮客户在法庭上辩护,我们公司里有很多机密类型的文件或者文档,请你为我列举一下这些类型,只需要给出类型名和该类型对应的风险等级,不需要输出json以外的内容<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=12635, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 151645, 198, 151644, 872, 198, 56568, 110124, 20074, 99464, 101087, 101057, 3837, 35946, 101909, 100376, 104586, 73218, 3837, 105783, 110178, 31838, 3837, 97639, 73218, 100668, 102808, 107069, 100376, 100703, 5373, 99663, 100017, 18493, 108943, 17447, 114051, 3837, 97639, 73218, 69249, 101194, 32648, 27641, 109963, 26898, 100631, 111116, 37945, 56568, 17714, 35946, 118569, 100158, 100001, 31905, 3837, 107525, 107485, 31905, 13072, 33108, 75882, 31905, 103124, 106066, 104408, 3837, 104689, 66017, 2236, 105175, 104597, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO 08-09 03:04:18 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

how skip compile FSM index?

viai957 commented 1 month ago

even I am facing the same issue Compiling FSM index is increasing the response time too much how do turn off the Compiling FSM index in the serve command? I believe Compiling the FSM index is brought in from version v0.6.2 Is there any way I could stop this by adding any command line argument to the serve image

zf123zf commented 3 weeks ago

Have you solved this problem? I have encountered this too.