vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

[Bug]: INFO 07-19 10:17:50 async_llm_engine.py:167] Aborted request 30d35945-d526-40bc-90c6-40ad73e639b9. INFO 07-19 10:17:50 async_llm_engine.py:49] Engine is gracefully shutting down. #6572

Open Adevils opened 3 months ago

Adevils commented 3 months ago

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1068-azure-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 10.1.243 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe MIG 3g.40gb Device 0: MIG 3g.40gb Device 1:

Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7V13 64-Core Processor Stepping: 1 CPU MHz: 2445.437 BogoMIPS: 4890.87 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 768 KiB L1i cache: 768 KiB L2 cache: 12 MiB L3 cache: 96 MiB NUMA node0 CPU(s): 0-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.1 [pip3] transformers==4.42.2 [pip3] triton==2.3.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-23 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

### 🐛 Describe the bug

import asyncio from uuid import uuid4 import os from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams

device_id = "MIG-0f67f02c-98bf-5250-9b0d-530252d6817f"

os.environ["CUDA_VISIBLE_DEVICES"] = device_id

def main(): engine = AsyncLLMEngine.from_engine_args( AsyncEngineArgs(model="google/gemma-2b", tensor_parallel_size=1, gpu_memory_utilization=0.2, max_model_len=1024, dtype="bfloat16") )

async def run_query(query: str):
    params = SamplingParams(
    top_k=10,
    temperature=0.01,
    repetition_penalty=1.10,
    max_tokens=150,
    top_p=0.9

)
    request_id = uuid4()
    outputs = engine.generate(query, params, request_id)

    async for request_output in outputs:
     final_output = request_output
     return final_output

async def process():
    queries = [
        "What is 5+3?",

    ]
    tasks = [asyncio.create_task(run_query(q)) for q in queries]
    results = []
    for task in asyncio.as_completed(tasks):
        result = await task
        results.append(result)
    return results

results = asyncio.run(process())
print(results)

if name == "main": main()

Below is the Terminal Output when the code is run:
INFO 07-19 10:36:52 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=google/gemma-2b, use_v2_block_manager=False, enable_prefix_caching=False) WARNING 07-19 10:36:53 gemma.py:56] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (gelu_pytorch_tanh). If you want to use the legacy gelu, edit the config JSON to set hidden_activation=gelu instead of hidden_act. See https://github.com/huggingface/transformers/pull/29402 for more details. INFO 07-19 10:36:53 weight_utils.py:218] Using model weights format ['*.safetensors'] INFO 07-19 10:36:54 model_runner.py:255] Loading model weights took 4.7384 GB INFO 07-19 10:36:55 gpu_executor.py:84] # GPU blocks: 3834, # CPU blocks: 14563 INFO 07-19 10:36:57 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 07-19 10:36:57 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 07-19 10:37:02 model_runner.py:1117] Graph capturing finished in 5 secs. INFO 07-19 10:37:02 async_llm_engine.py:646] Received request a0f2ca29-5058-4ad0-9a17-59ed431ab860: prompt: 'What is 5+3?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.01, top_p=0.9, top_k=10, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=150, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None. INFO 07-19 10:37:02 async_llm_engine.py:168] Aborted request a0f2ca29-5058-4ad0-9a17-59ed431ab860. INFO 07-19 10:37:02 async_llm_engine.py:50] Engine is gracefully shutting down. [RequestOutput(request_id=a0f2ca29-5058-4ad0-9a17-59ed431ab860, prompt='What is 5+3?', prompt_token_ids=[2, 1841, 603, 235248, 235308, 235340, 235304, 235336], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n\n', token_ids=(109,), cumulative_logprob=0.0, logprobs=None, finish_reason=None, stop_reason=None)], finished=False, metrics=RequestMetrics(arrival_time=1721385422.8926418, last_token_time=1721385422.9454794, first_scheduled_time=1721385422.8951688, first_token_time=1721385422.9452248, time_in_queue=0.002526998519897461, finished_time=None), lora_request=None)]

Why i am facing this error?? I wanted to handle concurrent requests using vllm and fast api, Is there any basic source code available ??

youkaichao commented 3 months ago

can you try to start an api server https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html and query it with openai client? manually use async llm engine can be error prone.

Adevils commented 3 months ago

MIG 3g.40gb Device 0: MIG 3g.40gb Device 1:

i did mig into 2 gpu, and want to run 2 different model in 2 gpus , how can I mention what device id to use, when using the docker.??

TangJiakai commented 3 months ago

Same Error

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!