Closed pseudotensor closed 1 day ago
Continues to happen for Pixtral every day we use it.
docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ \
-v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
-v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
-v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name pixtral \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=mistralai/Pixtral-12B-2409 \
--seed 1234 \
--tensor-parallel-size=1 \
--gpu-memory-utilization 0.98 \
--enforce-eager \
--tokenizer_mode mistral \
--limit_mm_per_prompt 'image=8' \
--max-model-len=49152 \
--max-log-len=100 \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt
Tried reducing max model len to 32k, still same issues. Constantly crashing due to that "pytorch bug"
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 334, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 153, in __init__
self.language_model = init_vllm_registered_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in init_vllm_registered_model
return build_model(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 515, in __init__
self.model = LlamaModel(config,
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 305, in __init__
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 420, in make_layers
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 307, in <lambda>
lambda prefix: LlamaDecoderLayer(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 217, in __init__
self.self_attn = LlamaAttention(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 162, in __init__
self.rotary_emb = get_rope(
^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 920, in get_rope
rotary_emb = RotaryEmbedding(head_size, rotary_dim, max_position, base,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 95, in __init__
cache = self._compute_cos_sin_cache()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 115, in _compute_cos_sin_cache
freqs = torch.einsum("i,j -> ij", t, inv_freq)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/functional.py", line 374, in einsum
return handle_torch_function(einsum, operands, equation, *operands)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/overrides.py", line 1630, in handle_torch_function
result = mode.__torch_function__(public_api, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 79, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/functional.py", line 386, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":838, please report a bug to PyTorch.
Maybe hidden GPU OOM issue: https://github.com/pytorch/pytorch/issues/112377
I'm trying reducing number of max sequences to 4 -- nope still crashes.
Will try to avoid MIG
Seems I needed more than 40GB for 8 image and even 4 sequences. So it's working now. Will close, assuming the bug is just that GPU OOM happens but not reported properly.
Your current environment
H100 40GB (using MIG)
Model Input Dumps
No response
🐛 Describe the bug
Fine, bug in torch.
However, vllm should shutdown cleanly not hang, so e.g. docker restart always can restart and recover.
Before submitting a new issue...