Open rvsh2 opened 1 month ago
Hi I use this:
server_vllm.py \ --model "/data/models_temp/functionary-small-v2.4/" \ --served-model-name "functionary" \ --dtype=bfloat16 \ --max-model-len 2048 \ --host 0.0.0.0 \ --port 8000 \ --enforce-eager \ --gpu-memory-utilization 0.94
on rtx 3090 24gb
Why I've got low speed?: Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
This is my config:
| INFO 05-11 08:17:48 server_vllm.py:473] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='functionary', grammar_sampling=False, model='/data/models_temp/functionary-small-v2.4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.94, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) functionary | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers functionary | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. functionary | INFO 05-11 08:17:49 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/models_temp/functionary-small-v2.4/', speculative_config=None, tokenizer='/data/models_temp/functionary-small-v2.4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) functionary | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. functionary | INFO 05-11 08:17:50 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 functionary | INFO 05-11 08:17:50 selector.py:28] Using FlashAttention backend. functionary | INFO 05-11 08:17:53 model_runner.py:173] Loading model weights took 13.4976 GB functionary | INFO 05-11 08:17:53 gpu_executor.py:119] # GPU blocks: 4185, # CPU blocks: 2048 functionary | INFO: Started server process [19] functionary | INFO: Waiting for application startup. functionary | INFO: Application startup complete. functionary | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Any update on this buddy?
Report of performance regression
Hi I use this:
on rtx 3090 24gb
Why I've got low speed?:
Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
This is my config: