vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.68k stars 4.26k forks source link

[Bug]: use thread after call multiple times. KeyError: request_id #4706

Open xubzhlin opened 5 months ago

xubzhlin commented 5 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.12 (main, Jul  9 2023, 15:32:42) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 520.61.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz
Stepping:              7
CPU MHz:               2600.000
BogoMIPS:              5200.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.15.0
[pip3] rapidocr-onnxruntime==1.3.17
[pip3] torch==2.3.0+cu118
[pip3] torchaudio==2.3.0+cu118
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity
GPU0     X  0-7     N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Run Stack

Exception in thread Thread-87 (run_async_generator):
Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception while serving /v1/knowledgebase/1771002863804219394/streamchat
Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/task.py", line 458, in execute
    for chunk in app_iter:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/flask/helpers.py", line 149, in generator
    yield from gen
  File "/home/admin/llm-chat/server/controller/knowledgebase_controller.py", line 153, in generate
    for response, history, source in knowledgebase_service.stream_chat_knowledgebase(knowledgebase_id, body["query"], body["history"]):
  File "/home/admin/llm-chat/server/application/service/knowledgebase_application_service.py", line 181, in stream_chat_knowledgebase
    for chunk in self.model.stream(messages):
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 249, in stream
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 241, in stream
    assert generation is not None
AssertionError
    self.run()
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 175, in run_async_generator
    asyncio.run(async_generator(queue))
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 169, in async_generator
    async for chunk in self._astream(messages, stop, run_manager, **kwargs):
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 217, in _astream
    async for request_output in results_generator:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 660, in generate
    async for request_output in stream:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 479, in engine_step
    self._request_tracker.process_request_output(
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 117, in process_request_output
    self._request_streams[request_id].put(request_output)
KeyError: '7999d4a0e29b40b08689d5aa7f7b2aeb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 175, in run_async_generator
    asyncio.run(async_generator(queue))
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 169, in async_generator
    async for chunk in self._astream(messages, stop, run_manager, **kwargs):
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 217, in _astream
    async for request_output in results_generator:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 650, in generate
    stream = await self.add_request(
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 537, in add_request
    self.start_background_loop()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
engine = AsyncLLMEngine()
queue =  Queue()

async def async_generator(prompt):
    request_id = str(uuid.uuid4().hex)
    sampling_params = SamplingParams(**sampling_kwargs)
    results_generator = engine .generate(prompt=prompt, sampling_params=sampling_params, request_id=request_id)
    async for request_output in results_generator:
         chunk = request_output.outputs[0].text
         queue.put(chunk)

def run_async_generator(prompt):
    asyncio.run(async_generator(prompt))

def consume():
    while True:
        chunk = queue.get(timeout=60)
        queue.task_done()
        if chunk is None:
            return
        yield chunk

def generator(prompt):
    thread = Thread(target=run_async_generator, args=(prompt,))
    thread.start()
    for chunk in consume(queue):
        print(chunk)

When I call multiple times generator in flask。This bug will appear。Later the error request_id is the same

Elissa0723 commented 4 months ago

I had the same problem...

Traceback (most recent call last): File "./swift/demo_server_vllm_xyf.py", line 106, in get_all_component_res async for request_output in results_generator: File "./vllm/vllm/engine/async_llm_engine.py",line 673,in generate async for output in self._process_request( File "./vllm/vllm/engine/async_llm_engine.py", line 780, in _process_request raise e File "./vllm/vllm/engine/asyncIlm_engine.py", line 776, in _process_request async for request output in stream: File "./vllm/vllm/engine/async_llm_engine.py", line 89, in _anext raise result File "./vllm/vllm/vllm/enggine/async_llm_engine.py", line 42, in _log_task_completiom return_value = task.result() File "./vllm/vllm/engine/async_limengine.py", line 532, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/opt/conda/envs/infer/lib/python3.10/asyncio/tasks.py", line 445in wait_for return fut.result() File "./vllm/vllm/vllm/engine/async_lngine.py", line 510, in engine_step self._request_tracker.process_request_output( File "./vllm/vllm/engine/async_llm_engine.py", line 130, in process_request_output self._request_streams[request_id].put(request_output) KeyError: 'cc2580f508eb473285a9e1bb47a6714f