[Bug]: KV Cache Quantization with GGUF turns out quite poorly.

Your current environment

The output of `python collect_env.py`

I attempted to run that, but it threw errors. I'm running this in docker on Windows 11

Model Input Dumps

No response

🐛 Describe the bug

 docker run --runtime nvidia --gpus all `
>>     -v "D:\AIModels:/models" `
>>     -p 8000:8000 `
>>     --ipc=host `
>>     -e VLLM_ATTENTION_BACKEND=FLASHINFER `
>>     vllm/vllm-openai:latest `
>>     --model "/models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf" `
>>     --tokenizer "Qwen/Qwen2.5-7B-Instruct" `
>>     --kv-cache-dtype fp8_e4m3

I've tried both fp8_e5m2 and fp8_e4m3.

The model works perfectly without kv-cache quantization.

When I enable it, the model gets exceptionally bad. With e5m2 it just repeats a word over and over. With e4m3 it gets about half a sentence in, then also repeats over forever.

I can understand perhaps a some loss in precision, but not basically 100%. It's essentially non-functional with the FP8 kv-cache. Without the quantized cache the model is on the level of GPT-3.5.

I had read that there shouldn't be much difference between the two, so I though perhaps it wasn't working right because it's a GGUF or something to do with that.

Here are the init logs:

PS D:\AITools\vllm> docker run --runtime nvidia --gpus all `
>>     -v "D:\AIModels:/models" `
>>     -p 8000:8000 `
>>     --ipc=host `
>>     -e VLLM_ATTENTION_BACKEND=FLASHINFER `
>>     vllm/vllm-openai:latest `
>>     --model "/models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf" `
>>     --tokenizer "Qwen/Qwen2.5-7B-Instruct" `
>>     --kv-cache-dtype fp8_e4m3
INFO 11-17 17:57:32 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-17 17:57:32 api_server.py:586] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf', task='auto', tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='fp8_e4m3', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 11-17 17:57:32 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/c559fdd5-610d-45c9-b324-75fc77e0c2ff for IPC Path.
INFO 11-17 17:57:32 api_server.py:194] Started engine process with PID 29
INFO 11-17 17:57:45 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-17 17:57:48 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-17 17:57:48 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-17 17:57:48 config.py:428] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 11-17 17:57:48 config.py:758] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 11-17 17:57:48 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-17 17:57:51 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-17 17:57:51 config.py:428] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 11-17 17:57:51 config.py:758] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 11-17 17:57:51 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-17 17:57:51 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='/models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=fp8_e4m3, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
WARNING 11-17 17:58:06 utils.py:720] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 11-17 17:58:06 selector.py:170] Using Flashinfer backend.
INFO 11-17 17:58:07 model_runner.py:1072] Starting to load model /models/MaziyarPanahi/Qwen2.5-7B-Instruct-abliterated-v2-GGUF/Qwen2.5-7B-Instruct-abliterated-v2.Q5_K_M.gguf...
/usr/local/lib/python3.12/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO 11-17 17:58:56 model_runner.py:1077] Loading model weights took 5.1287 GB
INFO 11-17 18:00:22 worker.py:232] Memory profiling results: total_gpu_memory=24.00GiB initial_memory_usage=6.44GiB peak_torch_memory=10.68GiB memory_usage_post_profile=6.71GiB non_torch_memory=1.32GiB kv_cache_size=9.61GiB gpu_memory_utilization=0.90
INFO 11-17 18:00:22 gpu_executor.py:113] # GPU blocks: 22486, # CPU blocks: 9362
INFO 11-17 18:00:22 gpu_executor.py:117] Maximum concurrency for 32768 tokens per request: 10.98x
INFO 11-17 18:00:23 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-17 18:00:23 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-17 18:00:51 model_runner.py:1518] Graph capturing finished in 30 secs, took 7.66 GiB
INFO 11-17 18:00:51 api_server.py:249] vLLM to use /tmp/tmpn8w0lmeq as PROMETHEUS_MULTIPROC_DIR
INFO 11-17 18:00:51 launcher.py:19] Available routes are:
INFO 11-17 18:00:51 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 11-17 18:00:51 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 11-17 18:00:51 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 11-17 18:00:51 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 11-17 18:00:51 launcher.py:27] Route: /health, Methods: GET
INFO 11-17 18:00:51 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-17 18:00:51 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-17 18:00:51 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-17 18:00:51 launcher.py:27] Route: /version, Methods: GET
INFO 11-17 18:00:51 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-17 18:00:51 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-17 18:00:51 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 11-17 18:00:59 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Seems that this is an issue specific to k-quants (might be i-matrix as well). I can generate reasonable outputs with standard quantization (Q8_0) checkpoints:

$ python examples/offline_inference.py --model ../Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf --kv-cache-dtype fp8 --max-tokens 128
/root/miniconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 11-18 13:28:31 __init__.py:31] No plugins found.
INFO 11-18 13:28:31 __init__.py:31] No plugins found.
INFO 11-18 13:28:48 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 11-18 13:28:52 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
WARNING 11-18 13:28:52 config.py:428] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 11-18 13:28:52 config.py:758] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 11-18 13:28:52 arg_utils.py:1065] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-18 13:28:52 llm_engine.py:249] Initializing an LLM engine (vdev) with config: model='../Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf', speculative_config=None, tokenizer='../Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=../Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None, pooler_config=None)
INFO 11-18 13:29:29 __init__.py:31] No plugins found.
INFO 11-18 13:29:29 selector.py:271] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 11-18 13:29:29 selector.py:273] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 11-18 13:29:29 selector.py:144] Using XFormers backend.
INFO 11-18 13:29:30 model_runner.py:1072] Starting to load model ../Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf...
/root/miniconda3/lib/python3.12/site-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO 11-18 13:29:45 model_runner.py:1077] Loading model weights took 1.7820 GB
INFO 11-18 13:29:50 worker.py:232] Memory profiling results: total_gpu_memory=23.65GiB initial_memory_usage=2.25GiB peak_torch_memory=4.25GiB memory_usage_post_profile=2.25GiB non_torch_memory=0.47GiB kv_cache_size=16.57GiB gpu_memory_utilization=0.90
INFO 11-18 13:29:51 gpu_executor.py:113] # GPU blocks: 77556, # CPU blocks: 18724
INFO 11-18 13:29:51 gpu_executor.py:117] Maximum concurrency for 32768 tokens per request: 37.87x
INFO 11-18 13:29:53 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-18 13:29:53 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-18 13:30:04 model_runner.py:1518] Graph capturing finished in 10 secs, took 1.55 GiB
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.56it/s, est. speed input: 25.10 toks/s, output: 584.14 toks/s]
Prompt: 'Hello, my name is', Generated text: " Josh and I'm here to help you understand what is actually going on with your kids. I'm a life coach and former public speaker, and I have a Masters in Psychology. I help parents, educators, and children make changes in their lives for the better, one step at a time. This site will help me to better serve you. I'll be happy to answer any questions or make a call or email you for you're interested. My contact information can be found on the home page. I am dedicated to helping you and your family. Thank you for checking out my site. Josh.\nCould you you use those words to describe"
Prompt: 'The president of the United States is', Generated text: ' the head of state and the head of government for the United States. The president serves as the commander-in-chief of the United States Armed Forces, has the power to appoint federal judges, and conducts foreign policy. The president also has the power to veto bills, and by using the power of the pardon, can issue pardons. The president is also the chief diplomat of the United States, can issue executive orders, has the power to grant reprieves, and has the power to commute sentences. In addition, the president can also offer a bill that proposes a constitutional amendment to Congress, and then the House of Representatives, and the Senate'
Prompt: 'The capital of France is', Generated text: ' Paris, located in the north of the country. The name Paris comes from the Latin form of the Greek name Paris, which means "piece of land" or "piece of land with Troy." The Romans brought the name when they conquered the region.\n\nThe original name of the city was Lutum, and the Romans renamed it after the legendary king of Troy, Paris, who stole Helen, the queen of Sparta, from her husband Menelaus. This name translates to "Troy\'s Place."\n\nIn the 1st century AD, the Romans renamed it to Lutum, which means "little Troy" or "T'
Prompt: 'The future of AI is', Generated text: ' an exciting one, and it’s currently been a rather challenging one for AI developers. With so being a new technology, developers are still working to finding ways to make their AI models more accurate and better. One of the biggest issues is that AI models are often overfitting to their training data, which an expensive and inefficient model. This is a major problem as it requires a large amount of data and computational resources to train, which are limited for many organizations.\nThis is where the standard approach to solve the overfitting problem, which is to train the model for more data, which an expensive and inefficient model. This is a major'

vllm-project / vllm