vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.68k stars 4.66k forks source link

[Bug]: use thread after call multiple times. KeyError: request_id #4706

Open xubzhlin opened 6 months ago

xubzhlin commented 6 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.12 (main, Jul  9 2023, 15:32:42) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 520.61.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz
Stepping:              7
CPU MHz:               2600.000
BogoMIPS:              5200.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.15.0
[pip3] rapidocr-onnxruntime==1.3.17
[pip3] torch==2.3.0+cu118
[pip3] torchaudio==2.3.0+cu118
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity
GPU0     X  0-7     N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Run Stack

Exception in thread Thread-87 (run_async_generator):
Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception while serving /v1/knowledgebase/1771002863804219394/streamchat
Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/waitress/task.py", line 458, in execute
    for chunk in app_iter:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/flask/helpers.py", line 149, in generator
    yield from gen
  File "/home/admin/llm-chat/server/controller/knowledgebase_controller.py", line 153, in generate
    for response, history, source in knowledgebase_service.stream_chat_knowledgebase(knowledgebase_id, body["query"], body["history"]):
  File "/home/admin/llm-chat/server/application/service/knowledgebase_application_service.py", line 181, in stream_chat_knowledgebase
    for chunk in self.model.stream(messages):
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 249, in stream
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 241, in stream
    assert generation is not None
AssertionError
    self.run()
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 175, in run_async_generator
    asyncio.run(async_generator(queue))
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 169, in async_generator
    async for chunk in self._astream(messages, stop, run_manager, **kwargs):
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 217, in _astream
    async for request_output in results_generator:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 660, in generate
    async for request_output in stream:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 479, in engine_step
    self._request_tracker.process_request_output(
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 117, in process_request_output
    self._request_streams[request_id].put(request_output)
KeyError: '7999d4a0e29b40b08689d5aa7f7b2aeb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/python-3.10.12/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 175, in run_async_generator
    asyncio.run(async_generator(queue))
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/python-3.10.12/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 169, in async_generator
    async for chunk in self._astream(messages, stop, run_manager, **kwargs):
  File "/home/admin/llm-chat/server/infrastructure/toolkit/llm/model/llama3_vllm_chat_model.py", line 217, in _astream
    async for request_output in results_generator:
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
    raise e
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 650, in generate
    stream = await self.add_request(
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 537, in add_request
    self.start_background_loop()
  File "/usr/local/python-3.10.12/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
engine = AsyncLLMEngine()
queue =  Queue()

async def async_generator(prompt):
    request_id = str(uuid.uuid4().hex)
    sampling_params = SamplingParams(**sampling_kwargs)
    results_generator = engine .generate(prompt=prompt, sampling_params=sampling_params, request_id=request_id)
    async for request_output in results_generator:
         chunk = request_output.outputs[0].text
         queue.put(chunk)

def run_async_generator(prompt):
    asyncio.run(async_generator(prompt))

def consume():
    while True:
        chunk = queue.get(timeout=60)
        queue.task_done()
        if chunk is None:
            return
        yield chunk

def generator(prompt):
    thread = Thread(target=run_async_generator, args=(prompt,))
    thread.start()
    for chunk in consume(queue):
        print(chunk)

When I call multiple times generator in flask。This bug will appear。Later the error request_id is the same

Elissa0723 commented 5 months ago

I had the same problem...

Traceback (most recent call last): File "./swift/demo_server_vllm_xyf.py", line 106, in get_all_component_res async for request_output in results_generator: File "./vllm/vllm/engine/async_llm_engine.py",line 673,in generate async for output in self._process_request( File "./vllm/vllm/engine/async_llm_engine.py", line 780, in _process_request raise e File "./vllm/vllm/engine/asyncIlm_engine.py", line 776, in _process_request async for request output in stream: File "./vllm/vllm/engine/async_llm_engine.py", line 89, in _anext raise result File "./vllm/vllm/vllm/enggine/async_llm_engine.py", line 42, in _log_task_completiom return_value = task.result() File "./vllm/vllm/engine/async_limengine.py", line 532, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/opt/conda/envs/infer/lib/python3.10/asyncio/tasks.py", line 445in wait_for return fut.result() File "./vllm/vllm/vllm/engine/async_lngine.py", line 510, in engine_step self._request_tracker.process_request_output( File "./vllm/vllm/engine/async_llm_engine.py", line 130, in process_request_output self._request_streams[request_id].put(request_output) KeyError: 'cc2580f508eb473285a9e1bb47a6714f

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!