vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

[Usage]: I updated VLLM to the latest one and I discover that when I launch serve, I cann't see the ouput of the special token like <bos>. How can I get them? #7033

Open yitianlian opened 1 month ago

yitianlian commented 1 month ago

Your current environment


Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.91-014-kangaroo.2.10.13.5c249cdaf.x86_64-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.199.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
字节序:                         Little Endian
CPU:                             96
在线 CPU 列表:                  0-95
厂商 ID:                        GenuineIntel
型号名称:                       Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列:                       6
型号:                           106
每个核的线程数:                 1
每个座的核数:                   96
座:                             1
步进:                           6
BogoMIPS:                       5800.00
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
虚拟化:                         VT-x
超管理器厂商:                   KVM
虚拟化类型:                     完全
L1d 缓存:                       4.5 MiB (96 instances)
L1i 缓存:                       3 MiB (96 instances)
L2 缓存:                        120 MiB (96 instances)
L3 缓存:                        48 MiB (1 instance)
NUMA 节点:                      1
NUMA 节点0 CPU:                 0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] numpydoc==1.7.0
[conda] _anaconda_depends         2024.06             py312_mkl_2  
[conda] blas                      1.0                         mkl  
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0           py312h5eee18b_1  
[conda] mkl_fft                   1.3.8           py312h5eee18b_0  
[conda] mkl_random                1.2.4           py312hdb19cb5_0  
[conda] numpy                     1.26.4          py312hc5e2394_0  
[conda] numpy-base                1.26.4          py312h0da6c21_0  
[conda] numpydoc                  1.7.0           py312h06a4308_0  
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     PHB     PHB     PHB     0-95            N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     PHB     PHB     PHB     0-95            N/A
mlx5_0  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB     PHB
mlx5_1  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB
mlx5_2  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB
mlx5_3  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks```

### How would you like to use vllm

I want to run inference of my own models. I don't know how to show it's special tokens.
DarkLight1337 commented 1 month ago

You can pass the include_stop_str_in_output parameter to the request. See a full list of parameters here.

yitianlian commented 1 month ago

sorry for the confusion I might make. I want to check if the inference's chat template is the same as the template in training.

DarkLight1337 commented 1 month ago

To check this, I would use a debugger and set a breakpoint where the input is passed to the engine inside the OpenAIServing* class.

yitianlian commented 1 month ago

When I use vllm version of 0.5.1+post, I find that it can output the chat template in the log of launching the model. But in the new version, this disappeared. Is there any parameter that can control the log?

DarkLight1337 commented 1 month ago

Can you show the logs and also the command which you've used to launch the server?

yitianlian commented 1 month ago
INFO 08-02 11:02:42 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-02 11:02:42 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, cha
t_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', tokenizer=
None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None
, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sli
ding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantiza
tion=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False,
 max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_del
ay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_m
in=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preem
ption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 08-02 11:02:42 config.py:715] Defaulting to use mp for distributed inference
INFO 08-02 11:02:42 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', speculative_config=None, tokenizer='/cpfs01/shared/Llm_code/gaoc
hang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, dow
nload_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingCo
nfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model, use_v2_block_manager=False, e
nable_prefix_caching=False)
INFO 08-02 11:02:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:43 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456803) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456802) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456804) DEBUG 08-02 11:02:43 parallel_state.py:803] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:55171 backend=nccl
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 utils.py:784] Found nccl from library libnccl.so.2
INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:44 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/xiechengxing/.cache/vllm/gpu_p2p_access_cache_for_4,5,6,7.json
INFO 08-02 11:02:45 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fef5db302d0>, local_subscribe_port
=42691, local_sync_port=45207, remote_subscribe_port=None, remote_sync_port=None)
INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456802) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456804) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
(VllmWorkerProcess pid=456803) INFO 08-02 11:02:45 model_runner.py:680] Starting to load model /cpfs01/shared/Llm_code/gaochang/xtuner_work_dirs/llama3/70b/swe_bench_json/2_0/64k_1e-5/hf_model...
Loading safetensors checkpoint shards:   0% Completed | 0/30 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   3% Completed | 1/30 [00:01<00:39,  1.37s/it]
Loading safetensors checkpoint shards:   7% Completed | 2/30 [00:03<00:43,  1.54s/it]
Loading safetensors checkpoint shards:  10% Completed | 3/30 [00:05<00:56,  2.08s/it]
Loading safetensors checkpoint shards:  13% Completed | 4/30 [00:07<00:48,  1.85s/it]
Loading safetensors checkpoint shards:  17% Completed | 5/30 [00:08<00:43,  1.73s/it]
Loading safetensors checkpoint shards:  20% Completed | 6/30 [00:10<00:43,  1.80s/it]
Loading safetensors checkpoint shards:  23% Completed | 7/30 [00:12<00:42,  1.85s/it]
Loading safetensors checkpoint shards:  27% Completed | 8/30 [00:14<00:39,  1.80s/it]
Loading safetensors checkpoint shards:  30% Completed | 9/30 [00:16<00:38,  1.83s/it]
Loading safetensors checkpoint shards:  33% Completed | 10/30 [00:17<00:33,  1.68s/it]
Loading safetensors checkpoint shards:  37% Completed | 11/30 [00:19<00:31,  1.67s/it]
Loading safetensors checkpoint shards:  40% Completed | 12/30 [00:20<00:28,  1.57s/it]
Loading safetensors checkpoint shards:  43% Completed | 13/30 [00:22<00:27,  1.62s/it]
Loading safetensors checkpoint shards:  47% Completed | 14/30 [00:24<00:28,  1.76s/it]
Loading safetensors checkpoint shards:  50% Completed | 15/30 [00:25<00:25,  1.70s/it]
Loading safetensors checkpoint shards:  53% Completed | 16/30 [00:27<00:23,  1.70s/it]
Loading safetensors checkpoint shards:  57% Completed | 17/30 [00:29<00:22,  1.71s/it]
Loading safetensors checkpoint shards:  60% Completed | 18/30 [00:31<00:22,  1.89s/it]
Loading safetensors checkpoint shards:  63% Completed | 19/30 [00:34<00:23,  2.11s/it]
Loading safetensors checkpoint shards:  67% Completed | 20/30 [00:35<00:18,  1.86s/it]
Loading safetensors checkpoint shards:  70% Completed | 21/30 [00:37<00:16,  1.81s/it]
Loading safetensors checkpoint shards:  73% Completed | 22/30 [00:39<00:14,  1.87s/it]
Loading safetensors checkpoint shards:  77% Completed | 23/30 [00:40<00:10,  1.55s/it]
Loading safetensors checkpoint shards:  80% Completed | 24/30 [00:41<00:09,  1.52s/it]
Loading safetensors checkpoint shards:  83% Completed | 25/30 [00:43<00:07,  1.56s/it]
Loading safetensors checkpoint shards:  87% Completed | 26/30 [00:44<00:06,  1.59s/it]
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:31 model_runner.py:692] Loading model weights took 32.8657 GB
Loading safetensors checkpoint shards:  90% Completed | 27/30 [00:47<00:05,  1.76s/it]                                                                                                                                                                                   
Loading safetensors checkpoint shards:  93% Completed | 28/30 [00:48<00:03,  1.64s/it]                                                                                                                                                                                   
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:34 model_runner.py:692] Loading model weights took 32.8657 GB                                                                                                                                                            
Loading safetensors checkpoint shards:  97% Completed | 29/30 [00:49<00:01,  1.55s/it]                                                                                                                                                                                   
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:35 model_runner.py:692] Loading model weights took 32.8657 GB                                                                                                                                                            
Loading safetensors checkpoint shards: 100% Completed | 30/30 [00:50<00:00,  1.37s/it]                                                                                                                                                                                   
Loading safetensors checkpoint shards: 100% Completed | 30/30 [00:50<00:00,  1.69s/it]                                                                                                                                                                                   

INFO 08-02 11:03:36 model_runner.py:692] Loading model weights took 32.8657 GB                                                                                                                                                                                           
INFO 08-02 11:03:43 distributed_gpu_executor.py:56] # GPU blocks: 26651, # CPU blocks: 3276                                                                                                                                                                              
INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.                         
INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease mem
ory usage.                                                                                                                                                                                                                                                               
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.                                                                                                                                                                                                                                                                   
(VllmWorkerProcess pid=456803) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.                                                                                                                                                                                                                                
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.                                                                                                                                                                                                                                                                   
(VllmWorkerProcess pid=456804) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.                                                                                                                                                                                                                                
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:47 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in th
e CLI.                                                                                                                                                                                                                                                                   
(VllmWorkerProcess pid=456802) INFO 08-02 11:03:47 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_
seqs` as needed to decrease memory usage.                                                                                                                                                                                                                                
(VllmWorkerProcess pid=456804) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses                                                                                                                                                       
(VllmWorkerProcess pid=456802) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses                                                                                                                                                       
INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses                                                                                                                                                                                      
(VllmWorkerProcess pid=456803) INFO 08-02 11:04:06 custom_all_reduce.py:219] Registering 5635 cuda graph addresses                                                                                                                                                       
(VllmWorkerProcess pid=456804) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.                                                                                                                                                            
(VllmWorkerProcess pid=456802) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.                                                                                                                                                            
INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.                                                                                                                                                                                           
(VllmWorkerProcess pid=456803) INFO 08-02 11:04:06 model_runner.py:1181] Graph capturing finished in 19 secs.                                                                                                                                                            
WARNING 08-02 11:04:06 serving_embedding.py:170] embedding_mode is False. Embedding API will not work.                                                                                                                                                                   
INFO 08-02 11:04:06 api_server.py:292] Available routes are:                                                                                                                                                                                                             
INFO 08-02 11:04:06 api_server.py:297] Route: /openapi.json, Methods: HEAD, GET                                                                                                                                                                                          
INFO 08-02 11:04:06 api_server.py:297] Route: /docs, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /redoc, Methods: HEAD, GET
INFO 08-02 11:04:06 api_server.py:297] Route: /health, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /tokenize, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /version, Methods: GET
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-02 11:04:06 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [456412]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
yitianlian commented 1 month ago
export VLLM_LOGGING_LEVEL=DEBUG 

MODEL_PATH=./hf_model
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH \
--port 8000 \
--api-key token-abc123 \
--tensor-parallel-size 4 
DarkLight1337 commented 1 month ago

The chat template is only printed if you supply your own in the command (pretty sure that has always been the case); otherwise, the template from HuggingFace is used automatically.

yitianlian commented 1 month ago

Thank you for the information. Could you tell me which parameter I should pass? Thank you~

DarkLight1337 commented 1 month ago

You can pass your own chat template using the --chat-template flag. See here for more details.