vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.69k stars 4.48k forks source link

[Feature]: make _init_tokenizer optional and support initiate LLMEngine without tokenizer #3647

Closed GeauxEric closed 6 months ago

GeauxEric commented 7 months ago

🚀 The feature, motivation and pitch

Currently the generate method supports inference based on prompt_token_ids:

    def generate(
        self,
        prompts: Optional[Union[str, List[str]]] = None,
        sampling_params: Optional[SamplingParams] = None,
        prompt_token_ids: Optional[List[List[int]]] = None,
        use_tqdm: bool = True,
        lora_request: Optional[LoRARequest] = None,
    ) -> List[RequestOutput]:

that means tokenizer is optional to the LLM engine.

However, to initiate an LLM engine, it always calls _init_tokenizer , which effectively makes tokenizer required.

The LLM engine cannot be initialized without a valid tokenizer argument.

In our application, we would love to use LLM's powerful engine for inference, but want to keep tokenizer as a separate service.

Alternatives

No response

Additional context

No response

simon-mo commented 7 months ago

I think the main blocker is tokenizer is also used during decode. See #3635