run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.34k stars 4.67k forks source link

Issue with Answer Refinement in gpt-3.5-turbo Chat Model within RAG Framework #1335

Closed nathanmoura closed 11 months ago

nathanmoura commented 1 year ago

I am currently utilizing the langchain library with ChatOpenAI(model='gpt-3.5-turbo') in a Retrieval Augmented Generation (RAG) framework, which is based on the llama-index's GPTSimpleVectorIndex.

However, when I execute the query function, the Language Model repeatedly responds with: "The original answer remains relevant and does not require refinement based on the additional context provided."

Interestingly, when I switch to the OpenAI(model='text-davinci-003') model, it functions as expected, successfully retrieving information and providing appropriate answers.

It appears that the gpt-3.5-turbo chat model might be unable or unwilling to refine answers generated in previous iterations.

EDIT: Upon further investigation, by setting the logging module to DEBUG level, I noticed that the chat model does provide accurate answers during previous iterations. The issue occurs only when the final prompt is as follows:

Human: We have the opportunity to refine the above answer (only if needed) with some more context below. [context chunk here] Given the new context, refine the original answer to better answer the question. If the context isn't useful, output the original answer again.

In this case, the model consistently states that no refinement is needed, instead of outputing the original answer again.

Does anyone have a fix for this?

harishgoli9 commented 1 year ago

Have the exact same problem. Would love to see how rest of you are dealing with it.

logan-markewich commented 1 year ago

@harishgoli9 @NathanMoura

The issue here is with the refine template. GPT 3.5 was recently "updated" and struggles with this process

I've been working on a new refine prompt. Maybe try it out and let me know if it helps! Trying to see if it helps enough people to make a PR to change the default

from langchain.prompts.chat import (
    AIMessagePromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

from llama_index.prompts.prompts import RefinePrompt

# Refine Prompt
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "I have more context below which can be used "
        "(only if needed) to update your previous answer.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, update the previous answer to better "
        "answer my previous query."
        "If the previous answer remains the same, repeat it verbatim. "
        "Never reference the new context or my previous query directly.",
    ),
]

CHAT_REFINE_PROMPT_LC = ChatPromptTemplate.from_messages(CHAT_REFINE_PROMPT_TMPL_MSGS)
CHAT_REFINE_PROMPT = RefinePrompt.from_langchain_prompt(CHAT_REFINE_PROMPT_LC)
...
index.query("my query", similarity_top_k=3, refine_template=CHAT_REFINE_PROMPT)
harishgoli9 commented 1 year ago

I just switched to gpt-4 model but would love to go back to gpt3.5 turbo when the usage ramps up

logan-markewich commented 11 months ago

We also merged a recent change to the refine prompt for gpt-3.5. Going to close this issue for now, feel free to reach out on discord though!

testliopavel commented 11 months ago

@logan-markewich still looks like this error is present for me with v0.7.15. It takes 10-15 iterations for gpt-35-turbo for Langchain process for Azure OpenAI to generate an answer. Adding any kind of query preambles/prompt templates increases the number of seconds to wait for the answer up to 300 sec.

I do understand that partly the issue is on Azure OpenAI side, while I can also notice in debugging more that valuable answers could be already received from 1st response. How could it be resolved? Is there any way to force prevention of occurance?

Just a while (2 weeks) ago it was working blazingly fast...