[Bug]: Requests larger than 75k input tokens cause `Input prompt (512 tokens) is too long and exceeds the capacity of block_manager` error

servient-ashwin commented 3 months ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Amazon Linux 2 (x86_64) GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17) Clang version: Could not collect CMake version: version 3.27.7 Libc version: glibc-2.26 Python version: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.10.220-209.869.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L4 Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 3364.353 BogoMIPS: 5299.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Versions of relevant libraries: [pip3] mypy==1.9.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.2 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] triton==3.0.0 [pip3] vllm-nccl-cu12==2.18.1.0.4.0 [conda] numpy 1.26.2 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I have a server up and running using this

vllm serve Mistral-Nemo-Instruct-2407/ --port 8006 --gpu-memory-utilization 0.9 --max-model-len 128000 --tensor-parallel-size 1 --pipeline-parallel-size 2 --quantization fp8 --uvicorn-log-level debug

on two separate NVidia GPUs. However recently I have started noticing this error that I do not recall seeing before. I am using documents that are up to 125k tokens in size.

Input prompt (512 tokens) is too long and exceeds the capacity of block_manager

i have tried looking around the issues list and going through what I think would be the solutions. I have tried the v2 block manager. setting max_num_batched_tokens to all possible values (outrageously even the context window of the model) but I keep seeing that error (replace number of tokens with the tokens set at max_num_batched_tokens).

I have also tried enabling/disabling chunked prefill and that didn't help either. I am not sure what's more left to do and looking for help around this problem.

The output of vllm once serve command is executed`

```text INFO 08-26 19:18:49 api_server.py:440] vLLM API server version 0.5.5 INFO 08-26 19:18:49 api_server.py:441] args: Namespace(model_tag='Mistral-Nemo-Instruct-2407/', host='langmodel2', port=8006, uvicorn_log_level='debug', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='Mistral-Nemo-Instruct-2407/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=128000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=) INFO 08-26 19:18:49 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/db61d9b2-4054-4ebb-92a8-3a5b1d8a81ed for RPC Path. INFO 08-26 19:18:49 api_server.py:161] Started engine process with PID 11035 INFO 08-26 19:18:54 config.py:813] Defaulting to use ray for distributed inference WARNING 08-26 19:18:54 arg_utils.py:839] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-26 19:18:54 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=512. 2024-08-26 19:18:54,825 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 10.0.4.226:6379... 2024-08-26 19:18:54,830 INFO worker.py:1779 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 INFO 08-26 19:18:54 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='Mistral-Nemo-Instruct-2407/', speculative_config=None, tokenizer='Mistral-Nemo-Instruct-2407/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Mistral-Nemo-Instruct-2407/, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-26 19:18:55 ray_gpu_executor.py:133] use_ray_spmd_worker: False INFO 08-26 19:19:04 utils.py:975] Found nccl from library libnccl.so.2 INFO 08-26 19:19:04 pynccl.py:63] vLLM is using nccl==2.20.5 (RayWorkerWrapper pid=2169, ip=10.0.4.135) INFO 08-26 19:19:04 utils.py:975] Found nccl from library libnccl.so.2 (RayWorkerWrapper pid=2169, ip=10.0.4.135) INFO 08-26 19:19:04 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 08-26 19:19:04 model_runner.py:879] Starting to load model Mistral-Nemo-Instruct-2407/... Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

servient-ashwin commented 3 months ago

Pretty arbitrary but anything above 75,000 tokens as a request seems to be causing this issue. Given the model context window is ~128k tokens

nicklausbrown commented 3 months ago

I'm also having this issue. It seems a silent failure possibly with longer input context lengths. I'm also using the new pipeline parallelism vs. tensor parallelism which likely changes how the kv cache is distributed in a multi-gpu setup.

youkaichao commented 2 months ago

@servient-ashwin if you have 2 L4 gpus, I doubt if you can have enough memory to hold 125k tokens 👀

servient-ashwin commented 2 months ago

@youkaichao That could be the case, however what would the requirements look like for a model like this? I am even using the fp8 bit quantization mode to load this model across two GPUs with pipeline_parallel as 2 and tensor_parallel as 1.

youkaichao commented 2 months ago

vLLM will print some logs like # GPU blocks: 790. Multiply the number by 16 (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration.

servient-ashwin commented 2 months ago

Yes indeed, however, from the logs I shared I get this

INFO 08-26 19:19:06 distributed_gpu_executor.py:56] # GPU blocks: 9650, # CPU blocks: 3276

which means 9650*16 = ~154k I should say. but it seems to be half of it as of now.

I think I no wunderstand what you mean -> I also found out #7039 which is almost as near to what I am facing right now.

vllm-project / vllm