noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.36k stars 59 forks source link

VLLM Error: AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data.<locals>.decode_fn' #124

Closed accupham closed 1 month ago

accupham commented 1 month ago

This sample code:

from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(
    base_url="http://my-vllm-server.local:8000/v1",
    api_key="1234"
)

content = """
...

Please return your response in JSON format, following this schema:

{
  "originalValue": string,
  "modifiedValue": string,
  "key": string,
  "errorType": string,
  "explanation": string
}

"""

class ResultModel(BaseModel):
    originalValue: str
    modifiedValue: str
    key: str
    errorType: str
    explanation: str

completion = client.chat.completions.create(
  model="model",
  messages=[
    {"role": "user", "content": content}
  ],
  extra_body={
    "guided_json": ResultModel.model_json_schema(),
    "guided_decoding_backend": "lm-format-enforcer"
  }
)
completion

Results in this VLLM error:

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-29 05:15:24 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-29 05:15:34 logger.py:36] Received request chat-671c30eb19c146ef8e86ec79e82237c2: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|> ... Please return your response in JSON format, following this schema:\n\n```\n{\n  "originalValue": string,\n  "modifiedValue": string,\n  "key": string,\n  "errorType": string,\n  "explanation": string\n}\n````

<snip>

INFO 07-29 05:15:34 async_llm_engine.py:173] Added request chat-671c30eb19c146ef8e86ec79e82237c2.
INFO 07-29 05:15:34 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
Traceback (most recent call last):
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data.<locals>.decode_fn'
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data.<locals>.decode_fn'
AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data.<locals>.decode_fn'
/mnt/workspace1/vllm/vllm/distributed/parallel_state.py:425: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1524.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
noamgat commented 1 month ago

How are you running your model? Specifically - tensor / data / model parallelism? Multiple copies?

On Mon, Jul 29, 2024 at 8:22 AM Kevin Pham @.***> wrote:

This sample code:

from openai import OpenAIfrom pydantic import BaseModelclient = OpenAI( base_url="http://my-vllm-server.local:8000/v1", api_key="1234" ) content = """...Please return your response in JSON format, following this schema:{ "originalValue": string, "modifiedValue": string, "key": string, "errorType": string, "explanation": string}""" class ResultModel(BaseModel): originalValue: str modifiedValue: str key: str errorType: str explanation: str completion = client.chat.completions.create( model="model", messages=[ {"role": "user", "content": content} ], extra_body={ "guided_json": ResultModel.model_json_schema(), "guided_decoding_backend": "lm-format-enforcer" } )completion

Results in this VLLM error:

INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 07-29 05:15:24 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 07-29 05:15:34 logger.py:36] Received request chat-671c30eb19c146ef8e86ec79e82237c2: prompt: '<|begin_of_text|><|start_header_id|>user<|end_header_id|> ... Please return your response in JSON format, following this schema:\n\n```\n{\n "originalValue": string,\n "modifiedValue": string,\n "key": string,\n "errorType": string,\n "explanation": string\n}\n````

INFO 07-29 05:15:34 async_llm_engine.py:173] Added request chat-671c30eb19c146ef8e86ec79e82237c2. INFO 07-29 05:15:34 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. Traceback (most recent call last): File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data..decode_fn' Traceback (most recent call last): Traceback (most recent call last): File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/home/k/.asdf/installs/python/3.10.13/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data..decode_fn' AttributeError: Can't pickle local object 'build_token_enforcer_tokenizer_data..decode_fn' /mnt/workspace1/vllm/vllm/distributed/parallel_state.py:425: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1524.) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
accupham commented 1 month ago

With pipeline parallelism. Single copy I think. Take a look.

python -m vllm.entrypoints.openai.api_server \
--model /mnt/workspace1/model/Meta-Llama-3.1-70B-Instruct-AWQ-INT4/ \
--pipeline-parallel-size 4 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.99 \
--max-model-len 8192 \
--enforce-eager \
--served-model-name model \
--quantization marlin   \
--guided-decoding-backend lm-format-enforcer
accupham commented 1 month ago

Just tried switching --pipeline-parallel-size 4 to --tensor-parallel-size 4, and it worked as expected.

There must be a bug with lm-format-enforcer and pipeline parallelism, which probably wasn't implemented yet in VLLM at the time you submitted a PR for lm-format-enforcer support.

noamgat commented 1 month ago

I assume that vLLM copies the sampler object to the final process in the pipeline-parallel stack, and that causes it to try to serialize the object between processes, which it cant currently. I'll look at it. (Generally speaking, tensor parallelism should be used if it works for you, pipeline parallelism is mainly for multi-node)

On Mon, Jul 29, 2024 at 4:24 PM Kevin Pham @.***> wrote:

Just tried switching --pipeline-parallel-size 4 to --tensor-parallel-size 4, and it worked as expected.

There must be a bug with lm-format-enforcer and pipeline parallelism, which probably wasn't implemented yet in VLLM at the time you submitted a PR for lm-format-enforcer support.

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/124#issuecomment-2255942117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2HGZN4HEHZ2N7HNFETZOY7BVAVCNFSM6AAAAABLTQTNXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVHE2DEMJRG4 . You are receiving this because you commented.Message ID: @.***>

robertgshaw2-neuralmagic commented 1 month ago

I am a core contributor to vLLM. I am currently working on multiprocessing in our OpenAI server, which is goign to deliver a 20% performance gain

https://github.com/vllm-project/vllm/pull/6883/files#diff-190c665c438d34a7190da9a4d9bc1ed24bed8b13ee1b3f20c6da5c8aa52b0f3b

I am currently blocked by the inability to use guided decoding with alongside feature because of inability to pickle these objects. The source of the issue is that we python cannot pickle local functions. Would it be possible to use functools.partial instead? These objects can be pickled.

noamgat commented 1 month ago

That sounds like a good plan. I'll try to do it in the next few days.

On Thu, Aug 1, 2024, 05:15 Robert Shaw @.***> wrote:

I am a core contributor to vLLM. I am currently working on multiprocessing in our OpenAI server, which is goign to deliver a 20% performance gain

https://github.com/vllm-project/vllm/pull/6883/files#diff-190c665c438d34a7190da9a4d9bc1ed24bed8b13ee1b3f20c6da5c8aa52b0f3b

I am currently blocked by the inability to use guided decoding with alongside feature because of inability to pickle these objects. The source of the issue is that we python cannot pickle local functions. Would it be possible to use functools.partial instead? These objects can be pickled.

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/124#issuecomment-2261817411, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2F25ELHWAYUVJPQHSLZPGK4BAVCNFSM6AAAAABLTQTNXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRRHAYTONBRGE . You are receiving this because you commented.Message ID: @.***>

noamgat commented 1 month ago

Modified master branch to use functools.partial instead. Can you check if after switching to installing LMFE from main branch, it works? If so, I will deploy a version with the fix.

noamgat commented 1 month ago

This should be solved in v0.10.6. Closing the issue, please reopen if the issue persists.