vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.49k stars 4.43k forks source link

[Bug]: vllm constantly crashing with NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":838, please report a bug to PyTorch. #9824

Closed pseudotensor closed 1 day ago

pseudotensor commented 3 days ago

Your current environment

H100 40GB (using MIG)

Model Input Dumps

No response

🐛 Describe the bug

INFO:     38.32.112.203:56266 - "GET /v1/models HTTP/1.1" 200 OK
INFO 10-29 18:38:36 logger.py:36] Received request chat-9df53213134647568f9172ae453ed0b9: prompt: 'Who are you?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=>
INFO 10-29 18:38:36 async_llm_engine.py:201] Added request chat-9df53213134647568f9172ae453ed0b9.
INFO 10-29 18:38:37 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 91.8%, CPU KV cache usage: 0.0%.
ERROR 10-29 18:38:39 async_llm_engine.py:58] Engine background task failed
ERROR 10-29 18:38:39 async_llm_engine.py:58] Traceback (most recent call last):
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return func(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1546, in execute_model
ERROR 10-29 18:38:39 async_llm_engine.py:58]     hidden_or_intermediate_states = model_executable(
ERROR 10-29 18:38:39 async_llm_engine.py:58]                                     ^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return forward_call(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 191, in forward
ERROR 10-29 18:38:39 async_llm_engine.py:58]     hidden_states = self.language_model.model(input_ids,
ERROR 10-29 18:38:39 async_llm_engine.py:58]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return forward_call(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 10-29 18:38:39 async_llm_engine.py:58]     hidden_states, residual = layer(
ERROR 10-29 18:38:39 async_llm_engine.py:58]                               ^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return forward_call(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 261, in forward
ERROR 10-29 18:38:39 async_llm_engine.py:58]     hidden_states = self.mlp(hidden_states)
ERROR 10-29 18:38:39 async_llm_engine.py:58]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return forward_call(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 88, in forward
ERROR 10-29 18:38:39 async_llm_engine.py:58]     x = self.act_fn(gate_up)
ERROR 10-29 18:38:39 async_llm_engine.py:58]         ^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._call_impl(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return forward_call(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 14, in forward
ERROR 10-29 18:38:39 async_llm_engine.py:58]     return self._forward_method(*args, **kwargs)
ERROR 10-29 18:38:39 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/activation.py", line 36, in forward_cuda
ERROR 10-29 18:38:39 async_llm_engine.py:58]     out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
ERROR 10-29 18:38:39 async_llm_engine.py:58]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-29 18:38:39 async_llm_engine.py:58] RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":838, please report a bug to PyTorch. 

Fine, bug in torch.

However, vllm should shutdown cleanly not hang, so e.g. docker restart always can restart and recover.

Before submitting a new issue...

pseudotensor commented 2 days ago

Continues to happen for Pixtral every day we use it.

pseudotensor commented 2 days ago
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=8' \
        --max-model-len=49152 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
pseudotensor commented 1 day ago

Tried reducing max model len to 32k, still same issues. Constantly crashing due to that "pytorch bug"

pseudotensor commented 1 day ago
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 334, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1058, in load_model
    self.model = get_model(model_config=self.model_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
    model = _initialize_model(model_config, self.load_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
    return build_model(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
    return model_class(config=hf_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 153, in __init__
    self.language_model = init_vllm_registered_model(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in init_vllm_registered_model
    return build_model(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
    return model_class(config=hf_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 515, in __init__
    self.model = LlamaModel(config,
                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 305, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 420, in make_layers
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 307, in <lambda>
    lambda prefix: LlamaDecoderLayer(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 217, in __init__
    self.self_attn = LlamaAttention(
                     ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 162, in __init__
    self.rotary_emb = get_rope(
                      ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 920, in get_rope
    rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 95, in __init__
    cache = self._compute_cos_sin_cache()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 115, in _compute_cos_sin_cache
    freqs = torch.einsum("i,j -> ij", t, inv_freq)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/functional.py", line 374, in einsum
    return handle_torch_function(einsum, operands, equation, *operands)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/overrides.py", line 1630, in handle_torch_function
    result = mode.__torch_function__(public_api, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/functional.py", line 386, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":838, please report a bug to PyTorch. 
pseudotensor commented 1 day ago

Maybe hidden GPU OOM issue: https://github.com/pytorch/pytorch/issues/112377

pseudotensor commented 1 day ago

I'm trying reducing number of max sequences to 4 -- nope still crashes.

Will try to avoid MIG

pseudotensor commented 1 day ago

Seems I needed more than 40GB for 8 image and even 4 sequences. So it's working now. Will close, assuming the bug is just that GPU OOM happens but not reported properly.