run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.9k stars 5.28k forks source link

[Question]: using Vllm as llm engine not working properly (generates repeated text) #16359

Open hafezmg48 opened 1 month ago

hafezmg48 commented 1 month ago

Question Validation

Question

I am creating a simple RAG that answers some questions. I used the Huggingface for the embedding, just like the examples, but I am using Vllm as the llm engine. The problem is that when prompting a question, the response seems to be a nonstop repeat of the same thing while it had to just generate an eos_token and stopped the text.

The embedding model is bge_large_en using HuggingfaceEmbedding() The llm model is llama3.1-8B-instruct using Vllm()

I have used the HuggingFaceLLM engine and it works fine. But when using with the Vllm engine it has this issue. Please find part of code defining llm engine below:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import torch
from llama_index.core import Settings
from transformers import AutoTokenizer

# load embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", device="cuda")

'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'''
def messages_to_prompt(messages):
    prompt = ""
    system_exist = False
    for message in messages:
        if message.role == 'system':
            prompt += f"<|start_header_id|>system<|end_header_id|>\n\n{message.content}<|eot_id|>"
            system_exist = True
        elif message.role == 'user':
            prompt += f"<|start_header_id|>user<|end_header_id|>\n\n{message.content}<|eot_id|>"
        elif message.role == 'assistant':
            prompt += f"<|start_header_id|>assistant<|end_header_id|>\n\n{message.content}<|eot_id|>"

    # ensure we start with a system prompt, insert blank if needed
    if not system_exist:
        prompt = "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|>" + prompt

    prompt = "<|begin_of_text|>" + prompt
    return prompt

def completion_to_prompt(completion):
    return f"{completion}"
    # return f"<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{completion}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

model_path = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# tokenizer = AutoTokenizer.from_pretrained(model_path)
# from llama_index.llms.huggingface import HuggingFaceLLM
# Settings.llm = HuggingFaceLLM(
#     model_name=model_path,
#     tokenizer_name=model_path,
#     context_window=4096,
#     max_new_tokens=2048,
#     # model_kwargs={"quantization_config": quantization_config},
#     model_kwargs={"torch_dtype" :torch.bfloat16},
#     generate_kwargs={"do_sample": True, "temperature": 0.5, "top_k": 25, "top_p": 0.9,  "pad_token_id": tokenizer.eos_token_id},
#     messages_to_prompt=messages_to_prompt,
#     completion_to_prompt=completion_to_prompt,
#     # system_prompt="You are a helpful assistant that answers user's questions based on your knowledge",
#     is_chat_model=True,
#     device_map="cuda",
# )

from llama_index.llms.vllm import Vllm
tokenizer = AutoTokenizer.from_pretrained(model_path)
Settings.llm = Vllm(
    model=model_path,
    dtype="float16",
    # tensor_parallel_size=1,
    temperature=0.5,
    top_p=0.9,
    top_k=25,
    max_new_tokens=2048,
    stop=[tokenizer.eos_token],
    # messages_to_prompt=messages_to_prompt,
    # completion_to_prompt=completion_to_prompt,
    system_prompt="You are a helpful assistant",
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.65,
        "max_model_len": 4096,
    },
)

I comment out either of the engines to test and compare them. As I said the HuggingfaceLLM works fine, but the vllm generates the following example:

1
Explanation: The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof. 

{
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset angle for ZOD in LOS conditions for RMa-AV, and the correct answer is option_1, which is Specular reflection on the building roof. This is because in LOS conditions, the direct path is the dominant path, and the offset angle for ZOD is determined by the specular reflection on the building roof."
}  {
  "answer": 1,
  "explanation": "The question is asking about the process for determining the offset

keeps repeating until reaches end of token generation.

Note that in HuggingFaceLLM, I have activated the is_chat which will use the messages_to_prompt function. But in this Vllm I could not find any arguments that make us able to activate messages_to_prompt, so it just uses the completion_to_prompt function. I think that this llm using completion instead of chat is part of the problem. I tried to write the messages_to_prompt as faithful as possible to prompt formatting, but I am not sure that my approach is generally correct or not. I would appreciate any guide. Thanks.

Here is also the code part creation of query engine and the prompting:


# Function to create the prompt
def create_prompt(question_data):
    question = question_data["question"]
    options = [question_data[f"option_{i}"] for i in range(1, 5)]

    options_str = ""
    for indx, option in enumerate(options):
        options_str += f"\noption_{indx+1}: " + option
    prompt = (
        f"Answer the following questions by responding in json format with answer object being only the number of option and explanation object being the explanation of answer.\n"
        f"Given the following question and options, what is the correct answer?\n"
        f"Question: {question}\n"
        f"Options: {options_str}\n"
        f"Provide the answer and an explanation in the json format."
    )

    return prompt

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="compact")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

question_key = f"question_{1}"
prompt = create_prompt(quiz[question_key])
response = query_engine.query(prompt)

P.S. Yes, I tried to review the documents and online forums, which resulted in my understanding that this issue might be caused by the fact that chat is not used, but I am not sure how correct my idea is. Furthermore, I am not sure how to force vllm to use the chat. Also not sure if the prompt formatting would work properly in cooperation with the llama-index's internal query prompts. Thanks again for any help.

dosubot[bot] commented 1 month ago

I found a similar discussion regarding the issue of repeated text generation when using Vllm as the LLM engine. You can check out the discussion titled "HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop" which might provide some insights [1].

To address the issue of repeated text generation with Vllm, you can use the messages_to_prompt function to format your input messages properly. This function helps in structuring the input to the model, which can reduce repetition. Here's an example of how you can implement it:

from llama_index.llms.vllm import Vllm

def messages_to_prompt(messages):
    prompt = "\n".join([str(x) for x in messages])
    return f"<s>[INST] {prompt} [/INST] </s>\n"

llm = Vllm(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    tensor_parallel_size=4,
    max_new_tokens=256,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
    messages_to_prompt=messages_to_prompt,
)

# Example usage
messages = ["What is a black hole?", "Explain it in simple terms."]
prompt = messages_to_prompt(messages)
response = llm.complete(prompt)
print(response.text)

Additionally, you can adjust the presence_penalty and frequency_penalty parameters to penalize repeated tokens, which might help in reducing repetition:

llm = Vllm(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    tensor_parallel_size=4,
    max_new_tokens=256,
    presence_penalty=0.5,  # Adjust this value as needed
    frequency_penalty=0.5,  # Adjust this value as needed
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
    messages_to_prompt=messages_to_prompt,
)

These penalties can be fine-tuned based on your specific use case to achieve the desired output quality [2].

To continue talking to Dosu, mention @dosu.

hafezmg48 commented 1 month ago

@dosu I believe your response is incorrect. Filling the messages_to_prompt argument does not automatically change the method from completion to chat, and by setting breakpoints I confirmed that even after filling this argument, the messages_to_prompt is not called.

Also, the example solution which manually inputs the user question to messages_to_prompt is not proper as I think llama-index has its own way of internally changing the query, which is why it takes these messages_to_prompt as arguments of llm engine in the first place. I need a human to guide me on this matter.

dosubot[bot] commented 1 month ago

This is a tough nut to crack, could really use your help on this, @logan-markewich.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Yea vllm is not using completion to prompt

Feel free to make a PR, source code is here https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-vllm/llama_index/llms/vllm/base.py