[Performance]: Why the avg. througput generation is low?

Report of performance regression

Hi I use this:

server_vllm.py \
  --model "/data/models_temp/functionary-small-v2.4/" \
  --served-model-name "functionary" \
  --dtype=bfloat16 \
  --max-model-len 2048 \
  --host 0.0.0.0 \
  --port 8000 \
  --enforce-eager \
  --gpu-memory-utilization 0.94

on rtx 3090 24gb

Why I've got low speed?: Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%

This is my config:

| INFO 05-11 08:17:48 server_vllm.py:473] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='functionary', grammar_sampling=False, model='/data/models_temp/functionary-small-v2.4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.94, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
functionary  | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
functionary  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
functionary  | INFO 05-11 08:17:49 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/models_temp/functionary-small-v2.4/', speculative_config=None, tokenizer='/data/models_temp/functionary-small-v2.4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
functionary  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
functionary  | INFO 05-11 08:17:50 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
functionary  | INFO 05-11 08:17:50 selector.py:28] Using FlashAttention backend.
functionary  | INFO 05-11 08:17:53 model_runner.py:173] Loading model weights took 13.4976 GB
functionary  | INFO 05-11 08:17:53 gpu_executor.py:119] # GPU blocks: 4185, # CPU blocks: 2048
functionary  | INFO:     Started server process [19]
functionary  | INFO:     Waiting for application startup.
functionary  | INFO:     Application startup complete.
functionary  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

vllm-project / vllm

[Performance]: Why the avg. througput generation is low? #4760

Report of performance regression