Closed GeauxEric closed 6 months ago
Currently the generate method supports inference based on prompt_token_ids:
generate
prompt_token_ids
def generate( self, prompts: Optional[Union[str, List[str]]] = None, sampling_params: Optional[SamplingParams] = None, prompt_token_ids: Optional[List[List[int]]] = None, use_tqdm: bool = True, lora_request: Optional[LoRARequest] = None, ) -> List[RequestOutput]:
that means tokenizer is optional to the LLM engine.
However, to initiate an LLM engine, it always calls _init_tokenizer , which effectively makes tokenizer required.
_init_tokenizer
The LLM engine cannot be initialized without a valid tokenizer argument.
In our application, we would love to use LLM's powerful engine for inference, but want to keep tokenizer as a separate service.
No response
I think the main blocker is tokenizer is also used during decode. See #3635
🚀 The feature, motivation and pitch
Currently the
generate
method supports inference based onprompt_token_ids
:that means tokenizer is optional to the LLM engine.
However, to initiate an LLM engine, it always calls
_init_tokenizer
, which effectively makes tokenizer required.The LLM engine cannot be initialized without a valid tokenizer argument.
In our application, we would love to use LLM's powerful engine for inference, but want to keep tokenizer as a separate service.
Alternatives
No response
Additional context
No response