vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.84k stars 4.69k forks source link

[Bug]: GPU Memory Utilization Lower Than Expected with --enable-prefix-caching #8242

Open hxer7963 opened 2 months ago

hxer7963 commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.27.7 Libc version: glibc-2.35 Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 232 On-line CPU(s) list: 0-231 Vendor ID: AuthenticAMD Model name: AMD EPYC 7K83 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 58 Socket(s): 2 Stepping: 1 BogoMIPS: 4890.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 3.6 MiB (116 instances) L1i cache: 3.6 MiB (116 instances) L2 cache: 58 MiB (116 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-115 NUMA node1 CPU(s): 116-231 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.36.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvimgcodec-cu12==0.2.0.7 [pip3] nvidia-nvjitlink-cu12==12.4.99 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [pip3] tritonclient==2.43.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi [conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] tritonclient 2.43.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X NODE NODE SYS PIX SYS SYS PIX SYS 0-115 0 NODE X PIX SYS NODE SYS SYS NODE SYS NODE PIX X SYS NODE SYS SYS NODE SYS SYS SYS SYS X SYS PIX NODE SYS NODE PIX NODE NODE SYS X SYS SYS PIX SYS SYS SYS SYS PIX SYS X NODE SYS NODE SYS SYS SYS NODE SYS NODE X SYS PIX PIX NODE NODE SYS PIX SYS SYS X SYS SYS SYS SYS NODE SYS NODE PIX SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Description: When launching vllm with the --enable-prefix-caching flag, the GPU memory utilization is only around 70%, which is lower than the expected 90% based on the gpu_memory_utilization=0.9 setting. The model being used is neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8.

Steps to Reproduce:

  1. Launch vllm with the following command:
    nohup python -m vllm.entrypoints.openai.api_server --enable-prefix-caching --tensor-parallel-size 1 --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 --trust-remote-code --enable-chunked-prefill 1>vlog 2>&1 &
  2. Confirm startup log
    INFO 09-07 00:53:29 api_server.py:459] vLLM API server version 0.6.0
    INFO 09-07 00:53:29 api_server.py:460] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=True, enable_prompt_adapter=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=256, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
    INFO 09-07 00:53:29 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/b4595389-84ab-4555-b131-f412f6b7ae8a for RPC Path.
    INFO 09-07 00:53:29 api_server.py:176] Started engine process with PID 72390
    WARNING 09-07 00:53:32 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    WARNING 09-07 00:53:35 arg_utils.py:872] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
    INFO 09-07 00:53:35 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=True, use_async_output_proc=True)
    INFO 09-07 00:53:36 model_runner.py:915] Starting to load model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8...
    ^MLoading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
    ^MLoading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.98s/it]
    ^MLoading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.42s/it]
    ^MLoading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.35s/it]
    INFO 09-07 00:53:44 model_runner.py:926] Loading model weights took 8.4939 GB
    INFO 09-07 00:53:49 gpu_executor.py:122] # GPU blocks: 22813, # CPU blocks: 2048
    INFO 09-07 00:53:52 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
    INFO 09-07 00:53:52 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
    INFO 09-07 00:54:06 model_runner.py:1335] Graph capturing finished in 15 secs.
    INFO 09-07 00:54:06 block_manager_v1.py:263] Automatic prefix caching is enabled.
    INFO 09-07 00:54:07 api_server.py:224] vLLM to use /tmp/tmpr07cznt6 as PROMETHEUS_MULTIPROC_DIR
    WARNING 09-07 00:54:07 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
    INFO 09-07 00:54:07 launcher.py:20] Available routes are:
    INFO 09-07 00:54:07 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
    INFO 09-07 00:54:07 launcher.py:28] Route: /docs, Methods: GET, HEAD
    INFO 09-07 00:54:07 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
    INFO 09-07 00:54:07 launcher.py:28] Route: /redoc, Methods: GET, HEAD
    INFO 09-07 00:54:07 launcher.py:28] Route: /health, Methods: GET
    INFO 09-07 00:54:07 launcher.py:28] Route: /tokenize, Methods: POST
    INFO 09-07 00:54:07 launcher.py:28] Route: /detokenize, Methods: POST
    INFO 09-07 00:54:07 launcher.py:28] Route: /v1/models, Methods: GET
    INFO 09-07 00:54:07 launcher.py:28] Route: /version, Methods: GET
    INFO 09-07 00:54:07 launcher.py:28] Route: /v1/chat/completions, Methods: POST
    INFO 09-07 00:54:07 launcher.py:28] Route: /v1/completions, Methods: POST
    INFO 09-07 00:54:07 launcher.py:28] Route: /v1/embeddings, Methods: POST
    INFO 09-07 00:54:07 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
    INFO:     Started server process [72320]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  3. Observe the GPU memory utilization. $ nvitop image

Expected Behavior: The GPU memory utilization should be around 90% as specified by gpu_memory_utilization=0.9.

Actual Behavior: The GPU memory utilization is approximately 70%, which is lower than the expected utilization.

Before submitting a new issue...

SolitaryThinker commented 2 months ago

You could try increasing the max batch size with --max-num-seqs. By default it is 256 which may be too small for fp8 8B

hxer7963 commented 2 months ago

You could try increasing the max batch size with --max-num-seqs. By default it is 256 which may be too small for fp8 8B

First, I would like to thank SolitaryThinker for the prompt response and the suggestion to increase --max_num_deqs from the default 256 to 512.

I have applied the recommendation, but the GPU memory utilization remains around 70%, which is still lower than the expected 90% based on the gpu_memory_utilization=0.9 setting.

  1. Launch vllm with the following command:

    nohup python -m vllm.entrypoints.openai.api_server --enable-prefix-caching --tensor-parallel-size 1 --model neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 --trust-remote-code --max_num_seqs 512 1>vlog 2>&1 &
  2. Confirm startup log

    INFO 09-07 13:36:33 api_server.py:459] vLLM API server version 0.6.0
    INFO 09-07 13:36:33 api_server.py:460] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key=None, block_size=16, chat_template=None, code_revision=None, collect_detailed_traces=None, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=True, enable_prompt_adapter=False, enforce_eager=False, engine_use_ray=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=None, max_num_batched_tokens=None, max_num_seqs=512, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', model_loader_extra_config=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=1, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=8000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, trust_remote_code=True, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
    INFO 09-07 13:36:33 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/2ca8518b-ac53-41e7-bcf4-ce75f414ed4b for RPC Path.
    INFO 09-07 13:36:33 api_server.py:176] Started engine process with PID 73460
    WARNING 09-07 13:36:36 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    WARNING 09-07 13:36:40 arg_utils.py:872] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
    INFO 09-07 13:36:40 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', speculative_config=None, tokenizer='/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=True, use_async_output_proc=True)
    INFO 09-07 13:36:41 model_runner.py:915] Starting to load model /mnt/llm_dataset/willhe/ckpt/Meta-Llama-3.1-8B-Instruct-quantized.w8a8...
    Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.44s/it]
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.69s/it]
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.65s/it]
    INFO 09-07 13:36:45 model_runner.py:926] Loading model weights took 8.4939 GB
    INFO 09-07 13:36:51 gpu_executor.py:122] # GPU blocks: 22813, # CPU blocks: 2048
    INFO 09-07 13:36:52 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
    INFO 09-07 13:36:52 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
    INFO 09-07 13:37:19 model_runner.py:1335] Graph capturing finished in 26 secs.
    INFO 09-07 13:37:19 block_manager_v1.py:263] Automatic prefix caching is enabled.
    INFO 09-07 13:37:19 api_server.py:224] vLLM to use /tmp/tmptstj9k3r as PROMETHEUS_MULTIPROC_DIR
    WARNING 09-07 13:37:19 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
    INFO 09-07 13:37:19 launcher.py:20] Available routes are:
    INFO 09-07 13:37:19 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
    INFO 09-07 13:37:19 launcher.py:28] Route: /docs, Methods: GET, HEAD
    INFO 09-07 13:37:19 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
    INFO 09-07 13:37:19 launcher.py:28] Route: /redoc, Methods: GET, HEAD
    INFO 09-07 13:37:19 launcher.py:28] Route: /health, Methods: GET
    INFO 09-07 13:37:19 launcher.py:28] Route: /tokenize, Methods: POST
    INFO 09-07 13:37:19 launcher.py:28] Route: /detokenize, Methods: POST
    INFO 09-07 13:37:19 launcher.py:28] Route: /v1/models, Methods: GET
    INFO 09-07 13:37:19 launcher.py:28] Route: /version, Methods: GET
    INFO 09-07 13:37:19 launcher.py:28] Route: /v1/chat/completions, Methods: POST
    INFO 09-07 13:37:19 launcher.py:28] Route: /v1/completions, Methods: POST
    INFO 09-07 13:37:19 launcher.py:28] Route: /v1/embeddings, Methods: POST
    INFO 09-07 13:37:19 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
    INFO:     Started server process [73390]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  3. Observe the GPU memory utilization. image

robertgshaw2-neuralmagic commented 2 months ago

@hxer7963

What is happening is that it seems like chunked_prefill is disabled in your configuration. Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%

I see in your command that your explicitly enabled chunked prefill. I need to look into why it is getting disabled. I think the --enable-chunked-prefill flag may not longer be a store_true 😢

Can you try setting --enable-chunked-prefill True explicitly?

hxer7963 commented 2 months ago

@robertgshaw2-neuralmagic I try setting --enable-chunked-prefill True and --enable-prefix-caching, the gpu memory utilization is ~ 90%. But the benchmark preformance is poor than only setting --enable-prefix-caching where the gpu memory utilization is only ~ 70%.

My final workaround is setting --enable-prefix-caching and --max-model-len 8192 to max gpu memory utlization and avoid the performace loss with mix enable chunked prefill and prefix caching.

I would like to confirm why chunked prefill and prefill caching seem incompatible?

ChuanhongLi commented 2 months ago

@hxer7963

What is happening is that it seems like chunked_prefill is disabled in your configuration. Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%

I see in your command that your explicitly enabled chunked prefill. I need to look into why it is getting disabled. I think the --enable-chunked-prefill flag may not longer be a store_true 😢

Can you try setting --enable-chunked-prefill True explicitly?

@robertgshaw2-neuralmagic Sorry to bother. I'm a little confused with the decription "Since this model has very long max_model_len=128k, we need to reserve space for a prefill of this size. This is why the memory usage is only 70%". The space reserved is used for what (activation value or something else)? Can you explain this in more detail? Thanks.