vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.08k stars 3.98k forks source link

Use custom tokenizer to decode text when using llm.generate function #3493

Closed oscar-martin closed 6 months ago

oscar-martin commented 6 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-11
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 1
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           6
BogoMIPS:                           4788.74
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           48 MiB (12 instances)
L3 cache:                           16 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-11
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-11    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

I have a fine-tuned model (from mistralai/Mistral-7B-v0.1) with additional tokens added and trained.

Model config.json:

{
  "_name_or_path": "mistralai/Mistral-7B-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.38.2",
  "use_cache": false,
  "vocab_size": 32027
}

The original model has 32000 as the vocab_size, I have added 27 additional tokens.

When I use the model to make inferences, the generated text does not "decode" additional tokens.

The code:

from vllm import LLM
from transformers import (
    AutoTokenizer
)

model = "<model_and_tokenizer_id>" # both are using the same name (id)

tokenizer = AutoTokenizer.from_pretrained(
    model,
)
llm = LLM(model=model, tokenizer=model)  # Name or path of your model
output = llm.generate("<|begincontext|><|user|>I'm hungry. Find places to eat please.<|system|>Sure thing. Which city would you like to eat in?<|user|>Let's go with Foster City please.<|system|>Sure. What kind of food are you hungry for?<|user|>Spicy Indian sound really good.<|system|>One moment. I found a great restaurant called Pastries N Chaat in Foster City.<|user|>Give me other suggestions as well<|system|>How about, Tabla Indian Restaurant in Foster City?<|user|>Can you find out if they are average priced?<|system|>sure. The price range would be inexpensive.<|user|>Perfect. That works<|system|>Should I reserve for you?<|beginlastuserutterance|>Yes, go ahead and do that.<|endlastuserutterance|><|endcontext|>")

print(output)

Output (with a bit of formatting for improving readability):

[RequestOutput(
request_id=0, 
prompt="<|begincontext|><|user|>I'm hungry. Find places to eat please.<|system|>Sure thing. Which city would you like to eat in?<|user|>Let's go with Foster City please.<|system|>Sure. What kind of food are you hungry for?<|user|>Spicy Indian sound really good.<|system|>One moment. I found a great restaurant called Pastries N Chaat in Foster City.<|user|>Give me other suggestions as well<|system|>How about, Tabla Indian Restaurant in Foster City?<|user|>Can you find out if they are average priced?<|system|>sure. The price range would be inexpensive.<|user|>Perfect. That works<|system|>Should I reserve for you?<|beginlastuserutterance|>Yes, go ahead and do that.<|endlastuserutterance|><|endcontext|>", 
prompt_token_ids=[32000, 32004, 32007, 315, 28742, 28719, 17160, 28723, 8769, 5563, 298, 5310, 4665, 28723, 32006, 12875, 1970, 28723, 9595, 2990, 682, 368, 737, 298, 5310, 297, 28804, 32007, 3169, 28742, 28713, 576, 395, 28517, 3805, 4665, 28723, 32006, 12875, 28723, 1824, 2112, 302, 2887, 460, 368, 17160, 354, 28804, 32007, 1670, 2451, 6735, 2622, 1528, 1179, 28723, 32006, 2387, 2470, 28723, 315, 1419, 264, 1598, 9926, 1987, 17860, 2040, 418, 689, 7748, 297, 28517, 3805, 28723, 32007, 16104, 528, 799, 17278, 390, 1162, 32006, 1602, 684, 28725, 14319, 2220, 6735, 23657, 440, 297, 28517, 3805, 28804, 32007, 2418, 368, 1300, 575, 513, 590, 460, 5151, 724, 5200, 28804, 32006, 1864, 28723, 415, 4144, 2819, 682, 347, 297, 5128, 4097, 28723, 32007, 24443, 28723, 1725, 3791, 32006, 10934, 315, 17575, 354, 368, 28804, 32008, 5592, 28725, 576, 6280, 304, 511, 369, 28723, 32009, 32005], 
prompt_logprobs=None, 

outputs=[
CompletionOutput(
index=0, 
text=' ReserveRestaurant Restaurants^city->F', 
token_ids=[32003, 32010, 32012, 32023, 22249, 9133, 3507, 440, 32024, 32014, 23657, 1549, 28815, 18373, 471, 28765], 
cumulative_logprob=-0.06896844378643863, 
logprobs=None, 
finish_reason=length)
], 
finished=True, metrics=RequestMetrics(arrival_time=516870.185780231, last_token_time=516870.185780231, first_scheduled_time=1710838070.7235024, first_token_time=1710838070.760219, time_in_queue=1710321200.537722, finished_time=1710838070.9955454), lora_request=None)]

From it, the output of output[0].outputs[0].text is ReserveRestaurant Restaurants^city->F. I have manually decoded the token_ids and the expected text should be: <|begintarget|><|begindsts|><|begindst|><|beginintent|> ReserveRestaurant<|endintent|><|beginbelief|> Restaurants^city->F.

Generated token_ids: token_ids=[32003, 32010, 32012, 32023, 22249, 9133, 3507, 440, 32024, 32014, 23657, 1549, 28815, 18373, 471, 28765]. Tokens greater or equal to 32000 are not properly decoded.

I have also tried with python3 -m vllm.entrypoints.api_server --model "<model_and_tokenizer_id>" --tokenizer "<model_and_tokenizer_id>" and I got same behavior.

What can I do to decode directly the generated text without me having to decode it manually?

oscar-martin commented 6 months ago

I have found it!

Just adding a SamplingParams with the skip_special_tokens=False to the llm.generate call made it work.

Snipped code:

#...
llm = LLM(model=model, tokenizer=model)  # Name or path of your model

p = SamplingParams(skip_special_tokens=False)
output = llm.generate("<|begincontext|><|user|>I'm hungry. Find places to eat please.<|system|>Sure thing. Which city would you like to eat in?<|user|>Let's go with Foster City please.<|system|>Sure. What kind of food are you hungry for?<|user|>Spicy Indian sound really good.<|system|>One moment. I found a great restaurant called Pastries N Chaat in Foster City.<|user|>Give me other suggestions as well<|system|>How about, Tabla Indian Restaurant in Foster City?<|user|>Can you find out if they are average priced?<|system|>sure. The price range would be inexpensive.<|user|>Perfect. That works<|system|>Should I reserve for you?<|beginlastuserutterance|>Yes, go ahead and do that.<|endlastuserutterance|><|endcontext|>", p)

print(output)