vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.01k stars 3.97k forks source link

[Bug]: MistralTokenizer object has no attribute 'get_vocab' #8358

Open maxDavid40 opened 1 week ago

maxDavid40 commented 1 week ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.11.7 (main, Dec 8 2023, 18:56:58) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.14.21-150500.55.49-default-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe Nvidia driver version: 535.104.12 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: AuthenticAMD Model name: AMD EPYC 7F72 24-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 Stepping: 0 BogoMIPS: 6387.87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 24 MiB (48 instances) L3 cache: 384 MiB (24 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-11,48-59 NUMA node1 CPU(s): 12-23,60-71 NUMA node2 CPU(s): 24-35,72-83 NUMA node3 CPU(s): 36-47,84-95 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Vulnerable Vulnerability Spec rstack overflow: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.20 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.0 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE NODE 36-47,84-95 3 N/A NIC0 NODE X PIX PIX PIX NIC1 NODE PIX X PIX PIX NIC2 NODE PIX PIX X PIX NIC3 NODE PIX PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 ```

🐛 Describe the bug

I try to use guided_json or response_format in request via vLLM server with Mistral-Nemo-Instruct-2407 and --tokenizer-mode mistral but get an AttributeError : AttributeError: 'MistralTokenizer' object has no attribute 'get_vocab'

Launch the server :

vllm serve /path/to/model/Mistral-Nemo-Instruct-2407 --tokenizer-mode mistral --max-model-len 4096 --host 127.0.0.1 --port 6379

Request in python with response_format:

import requests

url = "http://127.0.0.1:6379"
endpoints = "/v1/completions"

data = {
  "prompt": "Describe Ada Lovelace. Your answer should be in form of a json with the keywords name, age, is_alive, height_in_meters, names_of_children.",
  "model": "Mistral-Nemo-Instruct-2407",
  "temperature": 0.0,
  "repetition_penalty": 1,
  "top_p": 0.8,
  "max_tokens": 150,
  "response_format": {
    "type": "json_object"
  }
}

response = requests.post(url+endpoints, json=data)
response.json()

Same with guided_json :

import requests

url = "http://127.0.0.1:6379"
endpoints = "/v1/completions"

data = {
  "prompt": "Describe Ada Lovelace.",
  "model": "Mistral-Nemo-Instruct-2407",
  "temperature": 0.0,
  "repetition_penalty": 1,
  "top_p": 0.8,
  "max_tokens": 150,
  "guided_json": {
      "type": "object",
      "properties": {
          "name": {
              "type": "string"
          },
          "age": {
              "type": "integer"
          },
          "is_alive": {
              "type": "boolean"
          },
          "height_in_meters": {
              "type": "number"
          },
          "names_of_children": {
              "type": "array",
              "items": {
                  "type": "string"
              }
          }
      },
      "required": [
          "name",
          "age",
          "is_alive",
          "height_in_meters",
          "names_of_children"
      ]
  }
}

response = requests.post(url+endpoints, json=data)
response.json()

Get the same AttributeError :

Traceback (most recent call last):
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/routing.py", line 297, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/routing.py", line 210, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 302, in create_completion
    generator = await openai_serving_completion.create_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 102, in create_completion
    await self._guided_decode_logits_processor(request, tokenizer))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 161, in _guided_decode_logits_processor
    return await get_guided_decoding_logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/__init__.py", line 21, in get_guided_decoding_logits_processor
    return await get_outlines_guided_decoding_logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 78, in get_outlines_guided_decoding_logits_processor
    return await loop.run_in_executor(global_thread_pool,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 165, in _get_logits_processor
    return CFGLogitsProcessor(guide, tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 165, in __init__
    super().__init__(CFGLogitsProcessor._get_guide(cfg, tokenizer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/outlines/caching.py", line 122, in wrapper
    result = cached_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 151, in _get_guide
    tokenizer = _adapt_tokenizer(tokenizer)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 186, in _adapt_tokenizer
    tokenizer.vocabulary = tokenizer.get_vocab()
                           ^^^^^^^^^^^^^^^^^^^
AttributeError: 'MistralTokenizer' object has no attribute 'get_vocab'

Update : With the argument guided-decoding-backend = lm-format-enforcer, I get a TypeError

Traceback (most recent call last):
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/routing.py", line 297, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/fastapi/routing.py", line 210, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 302, in create_completion
    generator = await openai_serving_completion.create_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 102, in create_completion
    await self._guided_decode_logits_processor(request, tokenizer))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 161, in _guided_decode_logits_processor
    return await get_guided_decoding_logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/__init__.py", line 26, in get_guided_decoding_logits_processor
    return await get_lm_format_enforcer_guided_decoding_logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 30, in get_lm_format_enforcer_guided_decoding_logits_processor
    tokenizer_data = _cached_build_vllm_token_enforcer_tokenizer_data(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 121, in _cached_build_vllm_token_enforcer_tokenizer_data
    return build_vllm_token_enforcer_tokenizer_data(tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/lmformatenforcer/integrations/vllm.py", line 40, in build_vllm_token_enforcer_tokenizer_data
    return build_token_enforcer_tokenizer_data(tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/lmformatenforcer/integrations/transformers.py", line 77, in build_token_enforcer_tokenizer_data
    regular_tokens = _build_regular_tokens_list(tokenizer)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path/to/venv/venv_happyvllm/lib/python3.11/site-packages/lmformatenforcer/integrations/transformers.py", line 57, in _build_regular_tokens_list
    token_0 = tokenizer.encode("0")[-1]
              ^^^^^^^^^^^^^^^^^^^^^
TypeError: Tekkenizer.encode() missing 2 required positional arguments: 'bos' and 'eos'

Before submitting a new issue...

DarkLight1337 commented 1 week ago

Can you check whether #8364 can solve this issue? If the issue still persists, @patrickvonplaten is the mistral tokenizer intended to work with guided decoding?

patrickvonplaten commented 1 week ago

At the moment we do not have support for guided decoding. I can take a look though to see if it's easy to implement. I think it should be simple to add a get_vocab() function to the MistralTokenizer

DarkLight1337 commented 1 week ago

At the moment we do not have support for guided decoding. I can take a look though to see if it's easy to implement. I think it should be simple to add a get_vocab() function to the MistralTokenizer

get_vocab() was already added in the linked PR. So we just have to worry about guided decoding.

maxDavid40 commented 6 days ago

With the fix, now I get this error (same with response_format, guided_jsonor also guided_choice: RuntimeError('Cannot convert token ` �` (130971) to bytes: �') in vllm.model_executor.guided_decoding.outlines_logits.processors.BaseLogitsProcessor

Forced to run in debugging because the API returns a 500 error without extra informations

patrickvonplaten commented 3 days ago

Also linking this issue here: https://github.com/vllm-project/vllm/issues/8429#issuecomment-2353336799