Open hafezmg48 opened 1 month ago
I found a similar discussion regarding the issue of repeated text generation when using Vllm as the LLM engine. You can check out the discussion titled "HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop" which might provide some insights [1].
To address the issue of repeated text generation with Vllm, you can use the messages_to_prompt
function to format your input messages properly. This function helps in structuring the input to the model, which can reduce repetition. Here's an example of how you can implement it:
from llama_index.llms.vllm import Vllm
def messages_to_prompt(messages):
prompt = "\n".join([str(x) for x in messages])
return f"<s>[INST] {prompt} [/INST] </s>\n"
llm = Vllm(
model="mistralai/Mistral-7B-Instruct-v0.1",
tensor_parallel_size=4,
max_new_tokens=256,
vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
messages_to_prompt=messages_to_prompt,
)
# Example usage
messages = ["What is a black hole?", "Explain it in simple terms."]
prompt = messages_to_prompt(messages)
response = llm.complete(prompt)
print(response.text)
Additionally, you can adjust the presence_penalty
and frequency_penalty
parameters to penalize repeated tokens, which might help in reducing repetition:
llm = Vllm(
model="mistralai/Mistral-7B-Instruct-v0.1",
tensor_parallel_size=4,
max_new_tokens=256,
presence_penalty=0.5, # Adjust this value as needed
frequency_penalty=0.5, # Adjust this value as needed
vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
messages_to_prompt=messages_to_prompt,
)
These penalties can be fine-tuned based on your specific use case to achieve the desired output quality [2].
To continue talking to Dosu, mention @dosu.
@dosu I believe your response is incorrect. Filling the messages_to_prompt
argument does not automatically change the method from completion to chat, and by setting breakpoints I confirmed that even after filling this argument, the messages_to_prompt
is not called.
Also, the example solution which manually inputs the user question to messages_to_prompt
is not proper as I think llama-index has its own way of internally changing the query, which is why it takes these messages_to_prompt
as arguments of llm engine in the first place. I need a human to guide me on this matter.
This is a tough nut to crack, could really use your help on this, @logan-markewich.
To continue talking to Dosu, mention @dosu.
Yea vllm is not using completion to prompt
Feel free to make a PR, source code is here https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-vllm/llama_index/llms/vllm/base.py
Question Validation
Question
I am creating a simple RAG that answers some questions. I used the Huggingface for the embedding, just like the examples, but I am using Vllm as the llm engine. The problem is that when prompting a question, the response seems to be a nonstop repeat of the same thing while it had to just generate an eos_token and stopped the text.
The embedding model is bge_large_en using HuggingfaceEmbedding() The llm model is llama3.1-8B-instruct using Vllm()
I have used the HuggingFaceLLM engine and it works fine. But when using with the Vllm engine it has this issue. Please find part of code defining llm engine below:
I comment out either of the engines to test and compare them. As I said the HuggingfaceLLM works fine, but the vllm generates the following example:
keeps repeating until reaches end of token generation.
Note that in HuggingFaceLLM, I have activated the
is_chat
which will use themessages_to_prompt
function. But in this Vllm I could not find any arguments that make us able to activatemessages_to_prompt
, so it just uses thecompletion_to_prompt
function. I think that this llm using completion instead of chat is part of the problem. I tried to write themessages_to_prompt
as faithful as possible to prompt formatting, but I am not sure that my approach is generally correct or not. I would appreciate any guide. Thanks.Here is also the code part creation of query engine and the prompting:
P.S. Yes, I tried to review the documents and online forums, which resulted in my understanding that this issue might be caused by the fact that chat is not used, but I am not sure how correct my idea is. Furthermore, I am not sure how to force vllm to use the chat. Also not sure if the prompt formatting would work properly in cooperation with the llama-index's internal query prompts. Thanks again for any help.