vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.09k stars 4.72k forks source link

[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend #6766

Open oandreeva-nv opened 4 months ago

oandreeva-nv commented 4 months ago

Your current environment

PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models:

A100s 

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

šŸ› Describe the bug

I am trying to understand the vllm's workflow for distributed serving via multiprocessing. The original setup is deploying a model with tensor parallel size = 2 through Triton Inference Server and distributed_executor_backend: mp . While inference is going well, when server is shutting down , 2 processes pt_main_thread are not killed and their status is State: S (sleeping) .

The closes reproducer outside of Triton is this:

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.utils import random_uuid
import time 
import asyncio

SAMPLING_PARAMETERS = {"temperature": 0, "top_p": 1}

VLLM_ENGINE_CONFIG = {
    "model":"facebook/opt-125m",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.5,
    "enforce_eager": "true",
    "tensor_parallel_size":2
}

PROMPTS = [
    "The most dangerous animal is",
    "The capital of France is",
    "The future of AI is",
]

async def generate_python_vllm_output(prompt, llm_engine):
    request_id = random_uuid()
    sampling_params = SamplingParams(**SAMPLING_PARAMETERS)
    python_vllm_output = None
    last_output = None

    async for vllm_output in llm_engine.generate(prompt, sampling_params, request_id):
        last_output = vllm_output

    if last_output:
        python_vllm_output = [
            (prompt + output.text).encode("utf-8") for output in last_output.outputs
        ]

    return python_vllm_output

llm_engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**VLLM_ENGINE_CONFIG))
python_vllm_output = []
for i in range(len(PROMPTS * 1000)):
    python_vllm_output.extend(
        asyncio.run(generate_python_vllm_output(PROMPTS[i], llm_engine))
    )

And the workflow is the following:

# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
  21346 pts/0    00:00:00 top
  21927 pts/0    00:00:00 top
  22463 pts/0    00:00:00 ps
# python3 vllm_reproducer.py &
...
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.38it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.37it/s]

INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
(VllmWorkerProcess pid=22534) INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
INFO 07-25 00:18:58 distributed_gpu_executor.py:56] # GPU blocks: 68037, # CPU blocks: 14563

# pkill -9  python3
# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
  21346 pts/0    00:00:00 top
  21927 pts/0    00:00:00 top
  22465 pts/0    00:00:22 pt_main_thread
  22534 pts/0    00:00:14 pt_main_thread
  22576 pts/0    00:00:00 python3 <defunct>
  22745 pts/0    00:00:00 ps

And same, the above 2 processes are in the sleeping state based on cat /proc/_PID_/status

Any insights on vllm's distributed serving with multiprocessing is greatly appreciated.

KuntaiDu commented 4 months ago

I also observed similar thing... My current workaround is to pkill -f pt_main_thread after terminating vLLM server.

oandreeva-nv commented 4 months ago

pkill -f pt_main_thread after terminating vLLM server.

Unfortunately, this is not a viable solution for me

yums-gao commented 2 months ago

same issue here. pkill -f does not work for my case neither.

j-klesen commented 1 week ago

pkill -f pt_main_thread after terminating vLLM server.

This did not help in my case. I had to do:

top -b -n 1 | grep pt_main_thread | awk '{print $1}' | xargs kill -9