Deploying a vLLM server with LMFE support

noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

MIT License

1.42k stars 65 forks source link

Deploying a vLLM server with LMFE support #65

Closed jloganolson closed 5 months ago

jloganolson commented 8 months ago

Are there any examples of using lm-format-enforcer as part of a backend (akin to outlines)?

noamgat commented 8 months ago

This is a great feature request! It is possible and can probably use about 95% of the code that outlines used to achieve this. If someone wants to go ahead and create a PR I'll happily review it. I'm leaving it open here as a feature request to see how many votes it receives.

br3no commented 7 months ago

@noamgat I'm working on this right now and I have a question.

I have extended the vLLM server API as in the outlines implementation, allowing me to issue e.g. the following request:

response = requests.post("http://host:port/generate", json={
    "prompt" : "The best language for type-safe systems programming is ",
    "regex" : "(Python|Java|C|C\+\+|C#|JavaScript|PHP|Swift|Go|Ruby|TypeScript|Kotlin|Rust)",
    "max_tokens" : 10
})

When this call arrives at the vLLM server, I need to build the corresponding LogitsProcessor. E.g. like this:

build_vllm_logits_processor(llm.tokenizer.tokenizer, RegexParser(regex_string))

This line is taking over 35s for the example request above. Is this normal/expected? Is there a way to cache at least the ...TokenizerData?

br3no commented 7 months ago

The same thing happens with the JsonSchemaParser.

noamgat commented 7 months ago

Yes, its possible to cache the tokenizer data. vllm intergration has build_vllm_token_enforcer_tokenizer_data() function, and the result can be passed as the first parameter to build_vllm_logits_processor():

def build_vllm_logits_processor(llm: Union[vllm.LLM, PreTrainedTokenizerBase, TokenEnforcerTokenizerData], 
                                character_level_parser: CharacterLevelParser, 
                                analyze: bool=False) -> VLLMLogitsProcessor:

It would make perfect sense to use this option in server mode, as in most (all?) cases, you would be only using a single model, therefore a single tokenizer.

2533245542 commented 7 months ago

When adding vLLM support, could you help adapt to the openai api scenario?

For example, a typical use case I use for vllm is starting a backend with

python -m vllm.entrypoints.openai.api_server --model llama-model

Then in a piece of python code, I query the backend with

        openai.api_key = "EMPTY"
        openai.api_base = "http://localhost:8000/v1"
        response = openai.ChatCompletion.create(
            model=llama-model,
            messages=messages
        )

Can this scenario be intergrated into lm-format-enforce?

I am thinking maybe have the lm format specified when settign up the backend. So

python -m vllm.entrypoints.openai.api_server --model llama-model --template generation_template.json

Of course, I am not an expert and can not implement this, so just giving out suggestions here.

noamgat commented 7 months ago

@br3no , the latest versions greatly improved the tokenizer cache build time. Can you check if there is a significant improvement on your side?

noamgat commented 5 months ago

https://github.com/vllm-project/vllm/pull/3868 - If this feature interests you, please show your interest in this PR in vLLM which adds support for it in vLLM.

noamgat commented 5 months ago

vLLM 0.4.1 was released with support for LMFE, see this project's README for instructions.