[Bug]: In k8s pod, it takes approximately 1 hour to start the model using vllm

WangxuP commented 3 months ago

Your current environment

python

3.10.14

GPU is V100 * 4

nvidia-smi

Thu Jul 11 10:53:42 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:A1:00.0 Off |                    0 |
| N/A   39C    P0    55W / 300W |  17077MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:A2:00.0 Off |                    0 |
| N/A   37C    P0    56W / 300W |  17029MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:A3:00.0 Off |                    0 |
| N/A   37C    P0    54W / 300W |  17069MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:A4:00.0 Off |                    0 |
| N/A   37C    P0    57W / 300W |  17069MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     24349      C   ...ence_venv_0113/bin/python    17074MiB |
|    1   N/A  N/A     26087      C   ...kerWrapper.execute_method    17026MiB |
|    2   N/A  N/A     26309      C   ...kerWrapper.execute_method    17066MiB |
|    3   N/A  N/A     26449      C   ...kerWrapper.execute_method    17066MiB |
+-----------------------------------------------------------------------------+

pkgs

Package                           Version
--------------------------------- ------------
accelerate                        0.30.1
addict                            2.4.0
aiobotocore                       2.7.0
aiofiles                          23.2.1
aiohttp                           3.9.5
aioitertools                      0.11.0
aioprometheus                     23.12.0
aiosignal                         1.3.1
aliyun-python-sdk-core            2.15.1
aliyun-python-sdk-kms             2.16.3
altair                            5.3.0
annotated-types                   0.7.0
anyio                             4.4.0
argon2-cffi                       23.1.0
argon2-cffi-bindings              21.2.0
async-timeout                     4.0.3
attrs                             23.2.0
azure-core                        1.30.1
azure-storage-blob                12.20.0
bcrypt                            4.1.3
botocore                          1.31.64
certifi                           2024.6.2
cffi                              1.16.0
charset-normalizer                3.3.2
click                             8.1.7
cloudpickle                       3.0.0
cmake                             3.29.3
colorama                          0.4.6
coloredlogs                       15.0.1
contourpy                         1.2.1
crcmod                            1.7
cryptography                      42.0.7
cycler                            0.12.1
dataclasses-json                  0.6.6
datasets                          2.18.0
diffusers                         0.28.2
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
ecdsa                             0.19.0
einops                            0.8.0
environs                          9.5.0
exceptiongroup                    1.2.1
fastapi                           0.110.3
ffmpy                             0.3.2
filelock                          3.14.0
flatbuffers                       24.3.25
fonttools                         4.53.0
frozenlist                        1.4.1
fsspec                            2023.10.0
gast                              0.5.4
gradio                            4.26.0
gradio_client                     0.15.1
greenlet                          3.0.3
grpcio                            1.60.0
h11                               0.14.0
httpcore                          1.0.5
httptools                         0.6.1
httpx                             0.27.0
huggingface-hub                   0.23.2
humanfriendly                     10.0
idna                              3.7
importlib_metadata                7.1.0
importlib_resources               6.4.0
interegular                       0.3.3
isodate                           0.6.1
jieba                             0.42.1
Jinja2                            3.1.4
jmespath                          0.10.0
joblib                            1.4.2
jsonpatch                         1.33
jsonpointer                       2.4
jsonschema                        4.22.0
jsonschema-specifications         2023.12.1
kiwisolver                        1.4.5
langchain                         0.1.0
langchain-community               0.0.20
langchain-core                    0.1.23
langsmith                         0.0.87
lark                              1.1.9
llvmlite                          0.42.0
lm-format-enforcer                0.10.1
lxml                              5.2.2
markdown-it-py                    3.0.0
MarkupSafe                        2.1.5
marshmallow                       3.21.3
matplotlib                        3.9.0
mdurl                             0.1.2
minio                             7.2.7
modelscope                        1.14.0
mpmath                            1.3.0
msgpack                           1.0.8
multidict                         6.0.5
multiprocess                      0.70.16
mypy-extensions                   1.0.0
nest-asyncio                      1.6.0
networkx                          3.3
ninja                             1.11.1
numba                             0.59.1
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 8.9.2.26
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.555.43
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.5.40
nvidia-nvtx-cu12                  12.1.105
onnxruntime                       1.15.0
openai                            1.30.5
opencv-contrib-python             4.9.0.80
orjson                            3.10.3
oss2                              2.18.5
outlines                          0.0.34
packaging                         23.2
pandas                            2.2.2
passlib                           1.7.4
pdfminer.six                      20231228
pdfplumber                        0.11.0
peft                              0.11.1
pillow                            10.3.0
pip                               24.0
platformdirs                      4.2.2
prometheus_client                 0.20.0
prometheus-fastapi-instrumentator 7.0.0
protobuf                          5.27.0
psutil                            5.9.8
py-cpuinfo                        9.0.0
pyarrow                           16.1.0
pyarrow-hotfix                    0.6
pyasn1                            0.6.0
pycparser                         2.22
pycryptodome                      3.20.0
pydantic                          2.7.2
pydantic_core                     2.18.3
pydub                             0.25.1
Pygments                          2.18.0
pymilvus                          2.4.0
pynvml                            11.5.0
pyparsing                         3.1.2
PyPDF2                            3.0.1
pypdfium2                         4.30.0
python-dateutil                   2.9.0.post0
python-docx                       1.1.2
python-dotenv                     1.0.1
python-jose                       3.3.0
python-multipart                  0.0.9
pytz                              2024.1
PyYAML                            6.0.1
quantile-python                   1.1
ray                               2.23.0
referencing                       0.35.1
regex                             2024.5.15
requests                          2.32.3
rich                              13.7.1
rpds-py                           0.18.1
rsa                               4.9
ruff                              0.4.7
s3fs                              2023.10.0
safetensors                       0.4.3
scikit-learn                      1.5.0
scipy                             1.13.1
semantic-version                  2.10.0
sentence-transformers             3.0.0
sentencepiece                     0.2.0
setuptools                        69.5.1
shellingham                       1.5.4
simplejson                        3.19.2
six                               1.16.0
sniffio                           1.3.1
sortedcontainers                  2.4.0
SQLAlchemy                        2.0.30
sse-starlette                     2.1.0
starlette                         0.37.2
sympy                             1.12.1
tabulate                          0.9.0
tblib                             3.0.0
tenacity                          8.3.0
threadpoolctl                     3.5.0
tiktoken                          0.6.0
timm                              1.0.3
tokenizers                        0.19.1
tomli                             2.0.1
tomlkit                           0.12.0
toolz                             0.12.1
torch                             2.3.0
torchvision                       0.18.0
tqdm                              4.66.4
transformers                      4.41.0
triton                            2.3.0
typer                             0.11.1
typing_extensions                 4.12.1
typing-inspect                    0.9.0
tzdata                            2024.1
ujson                             5.10.0
urllib3                           2.0.7
uvicorn                           0.30.1
uvloop                            0.19.0
vllm                              0.4.3
vllm-flash-attn                   2.5.8.post2
vllm_nccl_cu12                    2.18.1.0.3.0
watchfiles                        0.22.0
websockets                        11.0.3
wheel                             0.43.0
wrapt                             1.16.0
xformers                          0.0.26.post1
xinference                        0.11.3
xoscar                            0.3.0
xxhash                            3.4.1
yapf                              0.40.2
yarl                              1.9.4
zipp                              3.19.1

k8s

kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.0
Kustomize Version: v4.5.4
Server Version: v1.24.0

docker

Docker version 25.0.5, build 5dc9bcc

🐛 Describe the bug

k8s yaml  resource is below
          resources:
            limits:
              cpu: 16
              memory: 100Gi
              nvidia.com/gpu: 4
            requests:
              cpu: 16
              memory: 40Gi
              nvidia.com/gpu: 4
          volumeMounts:
            - mountPath: /opt/xinference
              name: nfs-knowledge-model
            - mountPath: /app/xinference/xinference/web/ui
              name: nfs-xinference-ui
            - mountPath: /app/xinference/xinference/ai_knowledge/configs/config.yaml
              name: ailink-debot-ai-xinference
              readOnly: false
              subPath: config.yaml
            - name: shm-volume
              mountPath: /dev/shm

## attact in k8s pod
kubectl exec my_pod -n namespace bash

start VLLM in pod

VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 xinference_venv_0113/bin/python -m vllm.entrypoints.openai.api_server \
--model  /opt/xinference/models/deepseek-coder-33b-instruct \
--served-model-name deepseek-coder-33b-instruct \
--tensor-parallel-size 4 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--dtype float16  \
--enforce-eager

When I checked the log, I found that it kept looping around the following positions, and it took about an hour for it to start successfully. the log :

2024-07-11 10:33:31.659759 Call to try_get in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:582 from detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:838
2024-07-11 10:33:31.659782 Return from try_get in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:584 to detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:838
2024-07-11 10:33:31.659821 Call to _get_current_dispatch_mode_stack in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:119 from detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:845
2024-07-11 10:33:31.659850 Call to <listcomp> in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:121 from _get_current_dispatch_mode_stack in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:121
2024-07-11 10:33:31.659873 Return from <listcomp> in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:121 to _get_current_dispatch_mode_stack in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:121
2024-07-11 10:33:31.659891 Return from _get_current_dispatch_mode_stack in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_python_dispatch.py:121 to detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:845
2024-07-11 10:33:31.659916 Call to tree_leaves in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:841 from detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:849
2024-07-11 10:33:31.659938 Call to _tree_leaves_helper in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:823 from tree_leaves in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:847
2024-07-11 10:33:31.659960 Call to _is_leaf in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:608 from _tree_leaves_helper in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:828
2024-07-11 10:33:31.660001 Call to _get_node_type in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:601 from _is_leaf in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:609
2024-07-11 10:33:31.660028 Call to _is_namedtuple_instance in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:590 from _get_node_type in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:602
2024-07-11 10:33:31.660055 Return from _is_namedtuple_instance in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:594 to _get_node_type in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:602
2024-07-11 10:33:31.660077 Return from _get_node_type in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:604 to _is_leaf in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:609
2024-07-11 10:33:31.660099 Return from _is_leaf in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:609 to _tree_leaves_helper in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:828
2024-07-11 10:33:31.660120 Return from _tree_leaves_helper in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:830 to tree_leaves in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:847
2024-07-11 10:33:31.660139 Return from tree_leaves in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/utils/_pytree.py:848 to detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:849
2024-07-11 10:33:31.660167 Return from detect_fake_mode in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/_guards.py:864 to load_tensor in /opt/xinference/xinference_venv_0113/lib/python3.10/site-packages/torch/serialization.py:1374

vLLM main log is :

[root@ailink-debot-ai-xinference-767f46d469-jmdc8 xinference]# VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 xinference_venv_0113/bin/python -m vllm.entrypoints.openai.api_server \
> --model  /opt/xinference/models/deepseek-coder-33b-instruct \
> --served-model-name deepseek-coder-33b-instruct \
> --tensor-parallel-size 4 \
> --max-model-len 4096 \
> --gpu-memory-utilization 0.9 \
> --dtype float16  \
> --enforce-eager 
WARNING 07-11 10:12:13 config.py:1155] Casting torch.bfloat16 to torch.float16.
2024-07-11 10:12:16,420 INFO worker.py:1749 -- Started a local Ray instance.
INFO 07-11 10:12:17 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/opt/xinference/models/deepseek-coder-33b-instruct', speculative_config=None, tokenizer='/opt/xinference/models/deepseek-coder-33b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=deepseek-coder-33b-instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-11 10:12:26 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-11 10:12:26 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-67369fecf3f84da09482cb95cb064e67/VLLM_TRACE_FUNCTION_for_process_4271_thread_140719189325632_at_2024-07-11_10:12:26.449215.log
(RayWorkerWrapper pid=5550) WARNING 07-11 10:12:26 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:26 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-67369fecf3f84da09482cb95cb064e67/VLLM_TRACE_FUNCTION_for_process_5550_thread_140233365694272_at_2024-07-11_10:12:26.449737.log
INFO 07-11 10:12:26 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-11 10:12:26 selector.py:51] Using XFormers backend.
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:26 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:26 selector.py:51] Using XFormers backend.
DEBUG 07-11 10:12:36 parallel_state.py:87] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://100.83.74.126:60495 backend=nccl
(RayWorkerWrapper pid=5550) DEBUG 07-11 10:12:36 parallel_state.py:87] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://100.83.74.126:60495 backend=nccl
(RayWorkerWrapper pid=5701) WARNING 07-11 10:12:26 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=5701) INFO 07-11 10:12:26 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-67369fecf3f84da09482cb95cb064e67/VLLM_TRACE_FUNCTION_for_process_5701_thread_140569728448320_at_2024-07-11_10:12:26.450763.log [repeated 2x across cluster]
(RayWorkerWrapper pid=5701) INFO 07-11 10:12:26 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 2x across cluster]
(RayWorkerWrapper pid=5701) INFO 07-11 10:12:26 selector.py:51] Using XFormers backend. [repeated 2x across cluster]
INFO 07-11 10:12:36 utils.py:618] Found nccl from library libnccl.so.2
INFO 07-11 10:12:36 pynccl.py:65] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:36 utils.py:618] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:36 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 07-11 10:12:36 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
(RayWorkerWrapper pid=5550) INFO 07-11 10:12:36 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-11 10:12:36 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-11 10:12:36 selector.py:51] Using XFormers backend.

youkaichao commented 3 months ago

why do you use VLLM_TRACE_FUNCTION=1 in production environment? it is only useful for debugging.

WangxuP commented 3 months ago

why do you use VLLM_TRACE_FUNCTION=1 in production environment? it is only useful for debugging.

The production environment is not that I did not use VLLM_TRACE_FUNCTION=1 environment variable, it is in the development environment.For the convenience of locating the problem

youkaichao commented 3 months ago

is 1 hour before or after VLLM_TRACE_FUNCTION=1 ?

WangxuP commented 3 months ago

is 1 hour before or after VLLM_TRACE_FUNCTION=1 ?

No, When I didn't add this environment variable, it only took me an hour, so I added it for easy access to the trace log.

youkaichao commented 3 months ago

From what I can see:

vllm 0.4.3

You can try the latest vllm version.

        requests:
          cpu: 16
          memory: 40Gi

You can try to increase CPU memory for the container.

WangxuP commented 3 months ago

So NB, I have updated vllm==0.5.1 , not increase CPU memory for the container, and it should start in about 10 minutes. Waiting for 10 minutes should be within an acceptable range. But I am not very clear about the underlying reasons for this problem. What optimizations have you made? Where can I refer to the source code or PR? Ths!

youkaichao commented 3 months ago

it's difficult to tell, we are always optimizing anything we can optimize. if the newest version works, then you should just use the newest one.

if you want to dive deep, you can profile the time and find out the root cause, that is welcomed, if you'd like to do.

WangxuP commented 3 months ago

Okey! Recently, I have been researching this code and if I find any issues, I will raise a PR,

WangxuP commented 3 months ago

Ohhh! In k8s, special attention should be paid to the volumes path. If you are mounting an NFS service, the network communication between the NFS service and the current machine will also affect the startup speed of the model, as it involves the transfer of large model files.

vllm-project / vllm