vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.26k stars 3.14k forks source link

[Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. #5531

Open ashgold opened 3 weeks ago

ashgold commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit
Use quit() or Ctrl-D (i.e. EOF) to exit
>>> quit()
root@vllm-0-4-3-predictor-00001-deployment-54797fd955-p7b8g:/vllm-workspace# python3 collect_env.py
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8468
CPU family:                      6
Model:                           143
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        8
BogoMIPS:                        4199.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear flush_l1d arch_capabilities
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       6 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        256 MiB (128 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    0-127   0-1             N/A
GPU1    NV12     X      0-127   0-1             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang.

Based on vLLM v0.4.3, I tried to modify the code to know the status of where the multi lora adapter is currently up on the CPU/GPU.

As shown below, I simply made a call to self.list_loras() inside the do_log_stats() method of vllm/engine/llm_engine.py.

    def do_log_stats(
            self,
            scheduler_outputs: Optional[SchedulerOutputs] = None,
            model_output: Optional[List[SamplerOutput]] = None) -> None:
        """Forced log when no requests active."""
        if self.log_stats:
            logger.info(f"self.list_loras(): {self.list_loras()}")
            self.stat_logger.log(
                self._get_stats(scheduler_outputs, model_output))

I ran the framework through the openai entrypoint, and the do_log_stats() method works fine without LLM inference. However, the moment I call the /v1/completions API, it gets stuck on the list_loras() method, and I don't even get a response from the /v1/completions API. After 30 minutes in this state, the following error message is returned.


(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 222, in _run_worker_process
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     while self._execute_model_non_driver():
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 295, in _execute_model_non_driver
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     work.wait()
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]
ERROR 06-13 01:30:49 async_llm_engine.py:524] Engine iteration timed out. This should never happen!
ERROR 06-13 01:30:49 async_llm_engine.py:45] Engine background task failed
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
ERROR 06-13 01:30:49 async_llm_engine.py:45]     request_outputs = await self.engine.step_async()
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     output = await self.model_executor.execute_model_async(
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return await self._driver_execute_model_async(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return await self.driver_exec_model(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] During handling of the above exception, another exception occurred:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return fut.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] The above exception was the direct cause of the following exception:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
ERROR 06-13 01:30:49 async_llm_engine.py:45]     task.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
ERROR 06-13 01:30:49 async_llm_engine.py:45]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45]     raise exceptions.TimeoutError() from exc
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO:     172.21.232.226:37556 - "GET /metrics HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 244, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 235, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 219, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

If I add --disable-log-stats to the argument when running, do_log_stats() is not called, so the /v1/completions API responds normally.

In v0.4.2, the list_loras() method was called correctly, but since v0.4.3, the following scheduling improvements have been made, and this code seems to be the problem.

I'm also curious as to why the above PR is an issue with the list_loras() call.

njhill commented 3 weeks ago

@ashgold this is a deadlock and is expected given the way this the distributed task scheduling currently implemented.

I would not classify it as a bug since you've modified code internal to vLLM. You can't call these lora methods from within the model execution loop, which is what is happening here. I'd suggest just calling list_loras() it as needed from a separate external thread.

ashgold commented 2 weeks ago

@njhill I tried isolating the list_loras() method to call only when called using FastAPI's Background Task, but the result was the same. If calling distributed methods like add_loras and remove_loras in a separate thread causes a deadlock, I think this is a potential bug.

njhill commented 2 weeks ago

@ashgold the trouble is that when in distributed mode, these lora methods are currently essentially blocking functions and so shouldn't be called directly from an async context. It sounds like that's what you're trying here and if so it's not actually a separate thread.

It's not immediately obvious to me that list_loras in particular needs to call out to all workers, I'll look into that.

In the meantime, you could try just calling these methods via run_in_executor.

JaheimLee commented 2 weeks ago

@ashgold the trouble is that when in distributed mode, these lora methods are currently essentially blocking functions and so shouldn't be called directly from an async context. It sounds like that's what you're trying here and if so it's not actually a separate thread.

It's not immediately obvious to me that list_loras in particular needs to call out to all workers, I'll look into that.

In the meantime, you could try just calling these methods via run_in_executor.

We have similar timeout error here