vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Bug]: Server gets stuck during startup, last logline is 'Using XFormers backend' #4974

Closed thundergolfer closed 4 months ago

thundergolfer commented 5 months ago

Your current environment

root@9206753dc19c:/# python3 collect_env.py
Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.29.3
Libc version: glibc-2.36

Python version: 3.11.9 (main, May 14 2024, 08:15:15) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-4.4.0-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Vendor ID:           GenuineIntel
Model name:          unknown
CPU family:          6
Model:               85
Thread(s) per core:  0
Core(s) per socket:  0
Socket(s):           0
Stepping:            unknown
BogoMIPS:            2200.15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni md_clear arch_capabilities
Hypervisor vendor:   KVM
Virtualization type: full

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

VLLM is getting stuck on startup, and according to nvidia-smi it's before it writes anything to the GPU. I have uploaded the trace file which records up to around 2024-05-22 09:11:22. At that point the trace shows it looking stuck in sympy code. I tailed the file 10 minutes later and it appeared stuck in torch/_dynamo/allowed_functions.py:322

2024-05-22 09:21:11.606202 Return from <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 to _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
2024-05-22 09:21:11.607248 Call to <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 from _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
2024-05-22 09:21:11.608331 Return from <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 to _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
2024-05-22 09:21:11.610561 Call to <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 from _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
2024-05-22 09:21:11.611751 Return from <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 to _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
2024-05-22 09:21:11.612805 Call to <genexpr> in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322 from _is_allowed_module_prefix in /usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py:322
...
...

Logs from docker

[modal@gcp-a100-80gb-spot-europe-west4-a-0-3f055e3b-e040-4f4f-83c4-636 ~]$ sudo docker run --runtime=runsc-2 --shm-size=1000GB -e HF_TOKEN=hf_redacted --gpus '"device=GPU-1bcd0f02-eb2e-fb5e-6800-a823935cf73d,GPU-f2708712-36c6-0ebe-d2c5-4bac3606be81"' -p 8000:8000 sha256:85d6c108267be17ec5ae237a161d280fbff23171013a0775c204b3fe162a2875
INFO 05-22 09:08:32 api_server.py:151] vLLM API server version 0.4.1
INFO 05-22 09:08:32 api_server.py:152] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-05-22 09:08:37,148 INFO worker.py:1749 -- Started a local Ray instance.
INFO 05-22 09:09:13 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 05-22 09:09:45 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 05-22 09:09:45 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-76f923d42448455998c3863a1a69717a/VLLM_TRACE_FUNCTION_for_process_1_thread_139310113790848_at_2024-05-22_09:09:45.578484.log
INFO 05-22 09:09:45 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=5761) WARNING 05-22 09:09:45 logger.py:125] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(RayWorkerWrapper pid=5761) INFO 05-22 09:09:45 logger.py:129] Trace frame log is saved to /tmp/vllm/vllm-instance-76f923d42448455998c3863a1a69717a/VLLM_TRACE_FUNCTION_for_process_5761_thread_139360384682880_at_2024-05-22_09:09:45.584908.log
(RayWorkerWrapper pid=5761) INFO 05-22 09:09:45 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-22 09:09:59 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-22 09:09:59 selector.py:33] Using XFormers backend.
^CTraceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in <module>
    engine = AsyncLLMEngine.from_engine_args(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
    engine = cls(
             ^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 179, in _init_workers_ray
    self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
  File "/usr/local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 143, in init_worker
    self.worker = worker_class(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker.py", line 72, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 146, in __init__
    self.attn_backend = get_attn_backend(
                        ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/attention/selector.py", line 34, in get_attn_backend
    from vllm.attention.backends.xformers import (  # noqa: F401
  File "/usr/local/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 6, in <module>
    from xformers import ops as xops
  File "/usr/local/lib/python3.11/site-packages/xformers/__init__.py", line 12, in <module>
    from .checkpoint import (  # noqa: E402, F401
  File "/usr/local/lib/python3.11/site-packages/xformers/checkpoint.py", line 464, in <module>
    class SelectiveCheckpointWrapper(ActivationWrapper):
  File "/usr/local/lib/python3.11/site-packages/xformers/checkpoint.py", line 481, in SelectiveCheckpointWrapper
    @torch.compiler.disable
     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/compiler/__init__.py", line 93, in disable
    import torch._dynamo
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
    from . import allowed_functions, convert_frame, eval_frame, resume_execution
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 62, in <module>
    from .output_graph import OutputGraph
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 89, in <module>
    from .variables.builder import GraphArg, TrackedFake, VariableBuilder, wrap_fx_proxy
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 143, in <module>
    from .optimizer import OptimizerVariable
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/optimizer.py", line 5, in <module>
    from ..decorators import mark_static_address
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/decorators.py", line 140, in <module>
    @_disallow_in_graph_helper(throw_if_not_allowed=False)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/decorators.py", line 109, in inner
    allowed_functions._allowed_function_ids.remove(id(fn))
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 88, in remove
    function_ids = self()
                   ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 69, in __call__
    value = self.lazy_initializer()
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 417, in _allowed_function_ids
    return gen_allowed_objs_and_ids().object_ids
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 375, in gen_allowed_objs_and_ids
    _find_torch_objects(torch)
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 363, in _find_torch_objects
    _find_torch_objects(obj)
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 363, in _find_torch_objects
    _find_torch_objects(obj)
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 364, in _find_torch_objects
    elif _is_allowed_module_prefix(obj):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 322, in _is_allowed_module_prefix
    if any(mod_name.startswith(m) for m in disallowed_modules):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/_dynamo/allowed_functions.py", line 322, in <genexpr>
    if any(mod_name.startswith(m) for m in disallowed_modules):
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/vllm/logger.py", line 96, in _trace_calls
    with open(log_path, 'a') as f:
         ^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 214, in setstate
KeyboardInterrupt

Reproduction:

FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed "vllm==0.4.1" \
    "hf-transfer==0.1.6" \
    "huggingface_hub==0.22.2" \
    "fastapi" \
    "httpx"

ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV VLLM_TRACE_FUNCTION=1
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size", "2"]

Hoping just for guidance on what could be going wrong here. I'm not familiar with the code and don't have a clue what could cause the startup to get stuck

thundergolfer commented 5 months ago

Oh I think I got lucky with knob twiddling. Switching to VLLM_WORKER_MULTIPROC_METHOD=fork seems to have fixed it?

FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed "vllm==0.4.1" \
    "hf-transfer==0.1.6" \
    "huggingface_hub==0.22.2" \
    "fastapi" \
    "httpx"

COPY <<EOF repro.py
import os
EOF

ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV VLLM_TRACE_FUNCTION=0
ENV VLLM_WORKER_MULTIPROC_METHOD=fork

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "meta-llama/Meta-Llama-3-8B-Instruct", "--tensor-parallel-size", "2"]

I may also be the disabling VLLM_TRACE_FUNCTION speed up the startup 100x and thus 'unstuck' things.

youkaichao commented 5 months ago

VLLM_TRACE_FUNCTION is only used to debug stuck issues. Why do you turn it on anyway?

thundergolfer commented 5 months ago

I turned it on only to debug the stuck issue. But it then became a confounder because it slowed down startup so much, made it harder to distinguish between stuck and merely slow.

DarkLight1337 commented 4 months ago

We have added documentation for this situation in #5430. Please take a look.