vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.3k stars 4.59k forks source link

[Bug]: vLLM takes forever to load a locally stored 7B model #6876

Open vibhas-singh opened 3 months ago

vibhas-singh commented 3 months ago

Your current environment

PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.26

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.217-205.860.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.183.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7R32
Stepping:            0
CPU MHz:             3130.445
BogoMIPS:            5599.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.1
[pip3] torch==2.3.1
[pip3] torch-model-archiver==0.7.1b20230208
[pip3] torch-workflow-archiver==0.2.13b20240516
[pip3] torchaudio==2.2.0
[pip3] torchdata==0.7.1
[pip3] torchserve==0.11.0b20240516
[pip3] torchtext==0.17.0
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] triton==2.3.1
[conda] aws-ofi-nccl              1.9.1           aws_efa1.26.1_0    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] blas                      2.116                       mkl    conda-forge
[conda] blas-devel                3.9.0            16_linux64_mkl    conda-forge
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.1.0           h84fe81f_915    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] mkl-devel                 2022.1.0           ha770c72_916    conda-forge
[conda] mkl-include               2022.1.0           h84fe81f_915    conda-forge
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] pytorch-cuda              12.1                 ha16c6d3_5    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] pytorch-mutex             1.0                        cuda    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torch-model-archiver      0.7.1                   py310_0    pytorch
[conda] torch-workflow-archiver   0.2.13                  py310_0    pytorch
[conda] torchaudio                2.2.0               py310_cu121    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchdata                 0.7.1                     py310    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchserve                0.11.0                  py310_0    pytorch
[conda] torchtext                 0.17.0                    py310    https://aws-ml-conda.s3.us-west-2.amazonaws.com
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-7 0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I have fine-tuned a llama 7B model using transformers and QLoRA. Then I have merged the LoRA weights to the base model and saved it to the disk. The saved weights look something like this: image

Now I am trying to load these weights using the python API of vLLM using the following code:

model_path = "path/of/model"
tokenizer_path = "path/to/tokenizer"
llm = LLM(model=model_path, tokenizer=tokenizer_path, max_model_len=4096)

It is almost taking forever to load all the shards. The progress bar shows the estimated load time to be 51 mins. Loading one shard is almost taking 9 mins.

Few additional points:

DarkLight1337 commented 3 months ago

Please go through this troubleshooting guide and see if it can resolve your issue.

vibhas-singh commented 3 months ago

@DarkLight1337 I tried turning on the debugging using the flags - though I am not observing anything weird.

Sharing the logs here from the console (have redacted the IP):

INFO 07-30 16:24:13 llm_engine.py:176] Initializing an LLM engine (v0.5.3) with config: model='merged_model_1', speculative_config=None, tokenizer='code-llama-7b-text-to-sql', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=merged_model_1, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-30 16:24:14 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-30 16:24:14 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-e4e21a3b888745af8bd6ea332ced22e3/VLLM_TRACE_FUNCTION_for_process_15863_thread_140205088794432_at_2024-07-30_16:24:14.428276.log
DEBUG 07-30 16:24:17 parallel_state.py:803] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://XXX.XX.XX.XX:57753 backend=nccl
INFO 07-30 16:24:17 model_runner.py:680] Starting to load model merged_model_1...
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [08:12<49:16, 492.67s/it]

The sanity check is also successful.

DarkLight1337 commented 3 months ago

@youkaichao can you help with this?

youkaichao commented 3 months ago

The log shows it is loading model. So might be a disk io problem here?

vibhas-singh commented 3 months ago

Apologies for the naive question - but it there a way to identify the disk IO problems? Is it possible to have a sanity check like thing for disk IO as well - like we have for Incorrect hardware/driver in the troubleshooting guide?

Ziyi6 commented 2 months ago

Apologies for the naive question - but it there a way to identify the disk IO problems? Is it possible to have a sanity check like thing for disk IO as well - like we have for Incorrect hardware/driver in the troubleshooting guide?

Bro, did you find out the answer? I also met this slow loading problem. I use HDD but I noticed that the models were already loaded into the GPU, what the program did was always waiting

Ziyi6 commented 2 months ago

Apologies for the naive question - but it there a way to identify the disk IO problems? Is it possible to have a sanity check like thing for disk IO as well - like we have for Incorrect hardware/driver in the troubleshooting guide?

I tried to put the models to SSD, the problem solved... The loading time reduced to 2min comparing to 60min with HDD...

vibhas-singh commented 2 months ago

@Ziyi6 Not really.

I am using Amazon Sagemaker Notebook instances and from what I can infer from docs it - its already using general purpose SSD for storage. Refer: https://repost.aws/questions/QUsip6vrGlRQOhlyEjZNeEMg/ebs-volume-type-for-sagemaker-notebook-instances

I am yet to explore a way to check the actual IO speed I am getting - that is why I requested the maintainers if it possible to add a debugging script for slow IO as well.