[Bug]: GPU Memory Utilization Lower Than Expected with --enable-prefix-caching

hxer7963 commented 2 months ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.27.7 Libc version: glibc-2.35 Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 232 On-line CPU(s) list: 0-231 Vendor ID: AuthenticAMD Model name: AMD EPYC 7K83 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 58 Socket(s): 2 Stepping: 1 BogoMIPS: 4890.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 3.6 MiB (116 instances) L1i cache: 3.6 MiB (116 instances) L2 cache: 58 MiB (116 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-115 NUMA node1 CPU(s): 116-231 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.36.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvimgcodec-cu12==0.2.0.7 [pip3] nvidia-nvjitlink-cu12==12.4.99 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [pip3] tritonclient==2.43.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi [conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] tritonclient 2.43.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X NODE NODE SYS PIX SYS SYS PIX SYS 0-115 0 NODE X PIX SYS NODE SYS SYS NODE SYS NODE PIX X SYS NODE SYS SYS NODE SYS SYS SYS SYS X SYS PIX NODE SYS NODE PIX NODE NODE SYS X SYS SYS PIX SYS SYS SYS SYS PIX SYS X NODE SYS NODE SYS SYS SYS NODE SYS NODE X SYS PIX PIX NODE NODE SYS PIX SYS SYS X SYS SYS SYS SYS NODE SYS NODE PIX SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Description: When launching vllm with the --enable-prefix-caching flag, the GPU memory utilization is only around 70%, which is lower than the expected 90% based on the gpu_memory_utilization=0.9 setting. The model being used is neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8.

Steps to Reproduce:

Launch vllm with the following command:

nohup python -m vllm.entrypoints.openai.api_server --enable-prefix-caching --tensor-parallel-size 1 --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 --trust-remote-code --enable-chunked-prefill 1>vlog 2>&1 &

Confirm startup log

INFO 09-07 00:53:29 api_server.py:459] vLLM API server version 0.6.0
INFO 09-07 00:53:29 api_server.py:460] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=True, enable_prompt_adapter=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 09-07 00:53:29 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b4595389-84ab-4555-b131-f412f6b7ae8a for RPC Path.
INFO 09-07 00:53:29 api_server.py:176] Started engine process with PID 72390
WARNING 09-07 00:53:32 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 09-07 00:53:35 arg_utils.py:872] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 09-07 00:53:35 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=True, use_async_output_proc=True)
INFO 09-07 00:53:36 model_runner.py:915] Starting to load model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8...
^MLoading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
^MLoading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.98s/it]
^MLoading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.42s/it]
^MLoading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.35s/it]
INFO 09-07 00:53:44 model_runner.py:926] Loading model weights took 8.4939 GB
INFO 09-07 00:53:49 gpu_executor.py:122] # GPU blocks: 22813, # CPU blocks: 2048
INFO 09-07 00:53:52 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-07 00:53:52 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-07 00:54:06 model_runner.py:1335] Graph capturing finished in 15 secs.
INFO 09-07 00:54:06 block_manager_v1.py:263] Automatic prefix caching is enabled.
INFO 09-07 00:54:07 api_server.py:224] vLLM to use /tmp/tmpr07cznt6 as PROMETHEUS_MULTIPROC_DIR
WARNING 09-07 00:54:07 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-07 00:54:07 launcher.py:20] Available routes are:
INFO 09-07 00:54:07 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-07 00:54:07 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-07 00:54:07 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-07 00:54:07 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-07 00:54:07 launcher.py:28] Route: /health, Methods: GET
INFO 09-07 00:54:07 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-07 00:54:07 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-07 00:54:07 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-07 00:54:07 launcher.py:28] Route: /version, Methods: GET
INFO 09-07 00:54:07 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-07 00:54:07 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-07 00:54:07 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-07 00:54:07 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [72320]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Observe the GPU memory utilization. $ nvitop

Expected Behavior: The GPU memory utilization should be around 90% as specified by gpu_memory_utilization=0.9.

Actual Behavior: The GPU memory utilization is approximately 70%, which is lower than the expected utilization.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

SolitaryThinker commented 2 months ago

You could try increasing the max batch size with --max-num-seqs. By default it is 256 which may be too small for fp8 8B

hxer7963 commented 2 months ago

You could try increasing the max batch size with --max-num-seqs. By default it is 256 which may be too small for fp8 8B

First, I would like to thank SolitaryThinker for the prompt response and the suggestion to increase --max_num_deqs from the default 256 to 512.

I have applied the recommendation, but the GPU memory utilization remains around 70%, which is still lower than the expected 90% based on the gpu_memory_utilization=0.9 setting.

Launch vllm with the following command:

nohup python -m vllm.entrypoints.openai.api_server --enable-prefix-caching --tensor-parallel-size 1 --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 --trust-remote-code --max_num_seqs 512 1>vlog 2>&1 &

Confirm startup log

INFO 09-07 13:36:33 api_server.py:459] vLLM API server version 0.6.0
INFO 09-07 13:36:33 api_server.py:460] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=True, enable_prompt_adapter=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=512, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 09-07 13:36:33 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/2ca8518b-ac53-41e7-bcf4-ce75f414ed4b for RPC Path.
INFO 09-07 13:36:33 api_server.py:176] Started engine process with PID 73460
WARNING 09-07 13:36:36 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 09-07 13:36:40 arg_utils.py:872] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 09-07 13:36:40 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=True, use_async_output_proc=True)
INFO 09-07 13:36:41 model_runner.py:915] Starting to load model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.65s/it]
INFO 09-07 13:36:45 model_runner.py:926] Loading model weights took 8.4939 GB
INFO 09-07 13:36:51 gpu_executor.py:122] # GPU blocks: 22813, # CPU blocks: 2048
INFO 09-07 13:36:52 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-07 13:36:52 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-07 13:37:19 model_runner.py:1335] Graph capturing finished in 26 secs.
INFO 09-07 13:37:19 block_manager_v1.py:263] Automatic prefix caching is enabled.
INFO 09-07 13:37:19 api_server.py:224] vLLM to use /tmp/tmptstj9k3r as PROMETHEUS_MULTIPROC_DIR
WARNING 09-07 13:37:19 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-07 13:37:19 launcher.py:20] Available routes are:
INFO 09-07 13:37:19 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-07 13:37:19 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-07 13:37:19 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-07 13:37:19 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-07 13:37:19 launcher.py:28] Route: /health, Methods: GET
INFO 09-07 13:37:19 launcher.py:28] Route: /tokenize, Methods: POST
INFO 09-07 13:37:19 launcher.py:28] Route: /detokenize, Methods: POST
INFO 09-07 13:37:19 launcher.py:28] Route: /v1/models, Methods: GET
INFO 09-07 13:37:19 launcher.py:28] Route: /version, Methods: GET
INFO 09-07 13:37:19 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 09-07 13:37:19 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 09-07 13:37:19 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 09-07 13:37:19 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [73390]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Observe the GPU memory utilization.

robertgshaw2-neuralmagic commented 2 months ago

@hxer7963

What is happening is that it seems like chunked_prefill is disabled in your configuration. Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%

I see in your command that your explicitly enabled chunked prefill. I need to look into why it is getting disabled. I think the --enable-chunked-prefill flag may not longer be a store_true 😢

Can you try setting --enable-chunked-prefill True explicitly?

hxer7963 commented 2 months ago

@robertgshaw2-neuralmagic I try setting --enable-chunked-prefill True and --enable-prefix-caching, the gpu memory utilization is ~ 90%. But the benchmark preformance is poor than only setting --enable-prefix-caching where the gpu memory utilization is only ~ 70%.

My final workaround is setting --enable-prefix-caching and --max-model-len 8192 to max gpu memory utlization and avoid the performace loss with mix enable chunked prefill and prefix caching.

I would like to confirm why chunked prefill and prefill caching seem incompatible?

ChuanhongLi commented 2 months ago

@hxer7963

What is happening is that it seems like chunked_prefill is disabled in your configuration. Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%

I see in your command that your explicitly enabled chunked prefill. I need to look into why it is getting disabled. I think the --enable-chunked-prefill flag may not longer be a store_true 😢

Can you try setting --enable-chunked-prefill True explicitly?

@robertgshaw2-neuralmagic Sorry to bother. I'm a little confused with the decription "Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%". The space reserved is used for what (activation value or something else)? Can you explain this in more detail? Thanks.

vllm-project / vllm