When using AsyncLLM engine, llm.get_tokenizer is not found

jpeig commented 10 months ago

line 31, in build_vllm_logits_processor regular_tokens = build_regular_tokens_list(tokenizer) AttributeError: 'AsyncLLMEngine' object has no attribute 'get_tokenizer'

Instead of passing the entire LLM object, I would suggest just passing the tokenizer. This ensures compatibility with the variants of the vllm engine.

So:

build_vllm_logits_processor(tokenizer, parser)

noamgat commented 10 months ago

Thanks for the report! Indeed build_vllm_logits_processor() expects a vllm.LLM as the parameter. In the async usecase, how would you pass the tokenizer?

jpeig commented 10 months ago

vllm has a function get_tokenizer(). It's used in vllm in api_server.py @noamgat

sanixa commented 10 months ago

Hi @jpeig,

Have you been successfully to use AsyncLLMEngine with lm-format-enforcer? I have modifed code from fastchat and function "build_vllm_logits_processor", but get a weird result.

The model worker do generation continuously, and cannnot return anything. Also there are no any error messages. Sample output: INFO 12-05 11:51:12 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0% INFO 12-05 11:51:18 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0% INFO 12-05 11:51:23 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%

Here is my full code

jpeig commented 10 months ago

Yes - got it to work. Just pass the tokenizer to the following modified function:

def build_vllm_logits_processor(tokenizer, character_level_parser: CharacterLevelParser, analyze: bool=False) -> VLLMLogitsProcessor:
    """Build the logits processor function that llama.cpp will use to filter the tokens generated by the model. The result
    can be passed in the logits_processor list that is sent to the call or generate() method of llama.cpp models."""
    regular_tokens = build_regular_tokens_list(tokenizer)
    token_enforcer = TokenEnforcer(regular_tokens, character_level_parser, tokenizer.decode, tokenizer.eos_token_id)
    return VLLMLogitsProcessor(token_enforcer, analyze)

To get the tokenizer in your code:

from vllm.transformers_utils.tokenizer import get_tokenizer
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(engine_args)
engine_model_config = asyncio.run(engine.get_model_config())

tokenizer = get_tokenizer(
    engine_model_config.tokenizer,
    tokenizer_mode=engine_model_config.tokenizer_mode,
    trust_remote_code=engine_model_config.trust_remote_code)

sanixa commented 10 months ago

Thanks. @jpeig It works when i load model on a single GPU.

In two gpus situation, the model output do not follow regex and generate tokens until the upper limit.

noamgat commented 10 months ago

I modified the interface to accept either the LLM or the tokenizer, so you won't need a custom build_vllm_logits_processor anymore. Regarding dual GPU - I can't reproduce it on my side, can you put a breakpoint inside VLLMLogitsProcessor.__call__() and check if it even gets called in that case?

sanixa commented 10 months ago

Sure, the following test is based on lm-format-enforcer==0.7.2

Breakpoint inside VLLMLogitsProcessor.call() It actually breaks inside VLLMLogitsProcessor.call() 螢幕擷取畫面 2023-12-07 095609

Dual GPU problem - dual case Initialization INFO 12-07 09:57:29 llm_engine.py:72] Initializing an LLM engine with config: model='/home/ccoe/ooba/text-generation-webui-1127/models/TheBloke_Mistral-7B-Instruct-v0.1-AWQ/', tokenizer='/home/ccoe/ooba/text-generation-webui-1127/models/TheBloke_Mistral-7B-Instruct-v0.1-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, seed=0) Request INFO 12-07 09:58:52 async_llm_engine.py:370] Received request 9423bc0ceec34735a2889ddc60800684: prompt: "[INST] Please give me information about Michael Jordan. You MUST answer using the following json schema: [/INST]{'title': 'AnswerFormat', 'type': 'object', 'properties': {'first_name': {'title': 'First Name', 'type': 'string'}, 'last_name': {'title': 'Last Name', 'type': 'string'}, 'year_of_birth': {'title': 'Year Of Birth', 'type': 'integer'}, 'num_seasons_in_nba': {'title': 'Num Seasons In Nba', 'type': 'integer'}}, 'required': ['first_name', 'last_name', 'year_of_birth', 'num_seasons_in_nba']}", sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.01, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], ignore_eos=False, max_tokens=64, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None. Output 螢幕擷取畫面 2023-12-07 100013

Dual GPU problem - single case Initialization INFO 12-07 10:02:01 llm_engine.py:72] Initializing an LLM engine with config: model='/home/ccoe/ooba/text-generation-webui-1127/models/TheBloke_Mistral-7B-Instruct-v0.1-AWQ/', tokenizer='/home/ccoe/ooba/text-generation-webui-1127/models/TheBloke_Mistral-7B-Instruct-v0.1-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) Request INFO 12-07 10:02:16 async_llm_engine.py:370] Received request 990e6c7b83ad46c8bf413426d39885e3: prompt: "[INST] Please give me information about Michael Jordan. You MUST answer using the following json schema: [/INST]{'title': 'AnswerFormat', 'type': 'object', 'properties': {'first_name': {'title': 'First Name', 'type': 'string'}, 'last_name': {'title': 'Last Name', 'type': 'string'}, 'year_of_birth': {'title': 'Year Of Birth', 'type': 'integer'}, 'num_seasons_in_nba': {'title': 'Num Seasons In Nba', 'type': 'integer'}}, 'required': ['first_name', 'last_name', 'year_of_birth', 'num_seasons_in_nba']}", sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.01, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], ignore_eos=False, max_tokens=64, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None. Output 螢幕擷取畫面 2023-12-07 101303

noamgat commented 10 months ago

This is probably due to the SamplingParams object being serialized/deserialized between process communications (via Ray, presumably), and the logits processor, being a stateful function, does not get correctly serialize it.

In order to solve it, we would need to be able to pass "logits processor instructions" in the network request of VLLM server. I proposed something similar to huggingface-inference-server in https://github.com/huggingface/text-generation-inference/pull/1274, but did not get a response from the team yet so I didn't proceed with it.

If vLLM team would be interested in adopting a similar solution, it would cause the LMFE to also work with server / multi GPU deployments.

shubhra2 commented 5 months ago

Any progress on this? or any workarounds to make it work?

noamgat commented 5 months ago

v0.9.4 was just released which should address this issue. Can you please try?

noamgat commented 5 months ago

This should have been resolved in v0.9.4, please reopen if the issue persists.

noamgat / lm-format-enforcer

When using AsyncLLM engine, llm.get_tokenizer is not found #27