Open yitianlian opened 1 month ago
You can pass the include_stop_str_in_output
parameter to the request. See a full list of parameters here.
sorry for the confusion I might make. I want to check if the inference's chat template is the same as the template in training.
To check this, I would use a debugger and set a breakpoint where the input is passed to the engine inside the OpenAIServing*
class.
When I use vllm version of 0.5.1+post, I find that it can output the chat template in the log of launching the model. But in the new version, this disappeared. Is there any parameter that can control the log?
Can you show the logs and also the command which you've used to launch the server?
INFO 08-02 11:02:42 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-02 11:02:42 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, cha
t_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', tokenizer=
None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None
, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sli
ding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantiza
tion=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False,
max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_del
ay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_m
in=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preem
ption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-02 11:02:42 config.py:715] Defaulting to use mp for distributed inference
INFO 08-02 11:02:42 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', speculative_config=None, tokenizer='/cpfs01/shared/Llm_code/gaoc
hang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, dow
nload_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingCo
nfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model, use_v2_block_manager=False, e
nable_prefix_caching=False)
INFO 08-02 11:02:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456803) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456802) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456804) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
INFO 08-02 11:02:45 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fef5db302d0>, local_subscribe_port
=42691, local_sync_port=45207, remote_subscribe_port=None, remote_sync_port=None)
INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
Loading safetensors checkpoint shards: 0% Completed | 0/30 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 3% Completed | 1/30 [00:01<00:39, 1.37s/it]
Loading safetensors checkpoint shards: 7% Completed | 2/30 [00:03<00:43, 1.54s/it]
Loading safetensors checkpoint shards: 10% Completed | 3/30 [00:05<00:56, 2.08s/it]
Loading safetensors checkpoint shards: 13% Completed | 4/30 [00:07<00:48, 1.85s/it]
Loading safetensors checkpoint shards: 17% Completed | 5/30 [00:08<00:43, 1.73s/it]
Loading safetensors checkpoint shards: 20% Completed | 6/30 [00:10<00:43, 1.80s/it]
Loading safetensors checkpoint shards: 23% Completed | 7/30 [00:12<00:42, 1.85s/it]
Loading safetensors checkpoint shards: 27% Completed | 8/30 [00:14<00:39, 1.80s/it]
Loading safetensors checkpoint shards: 30% Completed | 9/30 [00:16<00:38, 1.83s/it]
Loading safetensors checkpoint shards: 33% Completed | 10/30 [00:17<00:33, 1.68s/it]
Loading safetensors checkpoint shards: 37% Completed | 11/30 [00:19<00:31, 1.67s/it]
Loading safetensors checkpoint shards: 40% Completed | 12/30 [00:20<00:28, 1.57s/it]
Loading safetensors checkpoint shards: 43% Completed | 13/30 [00:22<00:27, 1.62s/it]
Loading safetensors checkpoint shards: 47% Completed | 14/30 [00:24<00:28, 1.76s/it]
Loading safetensors checkpoint shards: 50% Completed | 15/30 [00:25<00:25, 1.70s/it]
Loading safetensors checkpoint shards: 53% Completed | 16/30 [00:27<00:23, 1.70s/it]
Loading safetensors checkpoint shards: 57% Completed | 17/30 [00:29<00:22, 1.71s/it]
Loading safetensors checkpoint shards: 60% Completed | 18/30 [00:31<00:22, 1.89s/it]
Loading safetensors checkpoint shards: 63% Completed | 19/30 [00:34<00:23, 2.11s/it]
Loading safetensors checkpoint shards: 67% Completed | 20/30 [00:35<00:18, 1.86s/it]
Loading safetensors checkpoint shards: 70% Completed | 21/30 [00:37<00:16, 1.81s/it]
Loading safetensors checkpoint shards: 73% Completed | 22/30 [00:39<00:14, 1.87s/it]
Loading safetensors checkpoint shards: 77% Completed | 23/30 [00:40<00:10, 1.55s/it]
Loading safetensors checkpoint shards: 80% Completed | 24/30 [00:41<00:09, 1.52s/it]
Loading safetensors checkpoint shards: 83% Completed | 25/30 [00:43<00:07, 1.56s/it]
Loading safetensors checkpoint shards: 87% Completed | 26/30 [00:44<00:06, 1.59s/it]
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:31 model_runner.py:692] Loading model weights took 32.8657 GB
Loading safetensors checkpoint shards: 90% Completed | 27/30 [00:47<00:05, 1.76s/it]
Loading safetensors checkpoint shards: 93% Completed | 28/30 [00:48<00:03, 1.64s/it]
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:34 model_runner.py:692] Loading model weights took 32.8657 GB
Loading safetensors checkpoint shards: 97% Completed | 29/30 [00:49<00:01, 1.55s/it]
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:35 model_runner.py:692] Loading model weights took 32.8657 GB
Loading safetensors checkpoint shards: 100% Completed | 30/30 [00:50<00:00, 1.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 30/30 [00:50<00:00, 1.69s/it]
INFO 08-02 11:03:36 model_runner.py:692] Loading model weights took 32.8657 GB
INFO 08-02 11:03:43 distributed_gpu_executor.py:56] # GPU blocks: 26651, # CPU blocks: 3276
INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease mem
ory usage.
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=456804) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses
(VllmWorkerProcess pid=456802) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses
INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses
(VllmWorkerProcess pid=456803) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses
(VllmWorkerProcess pid=456804) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.
(VllmWorkerProcess pid=456802) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.
INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.
(VllmWorkerProcess pid=456803) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.
WARNING 08-02 11:04:06 serving_embedding.py:170] embedding_mode is False. Embedding API will not work.
INFO 08-02 11:04:06 api_server.py:292] Available routes are:
INFO 08-02 11:04:06 api_server.py:297] Route: /openapi.json, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /docs, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /redoc, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /health, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /tokenize, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /version, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO: Started server process [456412]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
export VLLM_LOGGING_LEVEL=DEBUG
MODEL_PATH=./hf_model
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH \
--port 8000 \
--api-key token-abc123 \
--tensor-parallel-size 4
The chat template is only printed if you supply your own in the command (pretty sure that has always been the case); otherwise, the template from HuggingFace is used automatically.
Thank you for the information. Could you tell me which parameter I should pass? Thank you~
You can pass your own chat template using the --chat-template
flag. See here for more details.
Your current environment