run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.3k stars 4.66k forks source link

[Question]: Help needed with CondenseQuestionChatEngine in llama index #12918

Open pathucr7 opened 2 months ago

pathucr7 commented 2 months ago

Question Validation

Question

I am trying to create a RAG chatbot that only answers from the content that i provide through pdf and never uses its own knowledge to answer or add information to the answer that is outside of the context. At this moment i am facing three issues:- 1) the answer length is quite small 2) the follow up question that it creates is totally of the mark 3) model overload issues

As i am creating a RAG chatbot for e-book summarization the answers are meant to be lengthy. I have used HuggingFace for embeddings as well as for the LLM. There was very less docs to refer to when using CondenseQuestionChatEngine so add to learn for myself. I thought i had solved it then i realize it was running on default llm which is openai, and after solving that issue all the prompts that i created for the Condensequestion started to fail, now if i run the question in query engine and print it i get the right answers. but the chat engine when trying to make sense using the chat history it completely goes of the grid and ask the wrong question all together and way it is setup the bot will revert with i don't know. here is my code for reference:- ` import streamlit as st from llama_index.core import Settings import logging import sys import os.path

logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index.core.service_context import set_global_service_context

from llama_index.llms.llama_cpp import LlamaCPP

from langchain.embeddings.huggingface import HuggingFaceEmbeddings from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint from llama_cpp import Llama

from langchain.llms import HuggingFaceHub

from llama_index.core.prompts.chat_prompts import ChatPromptTemplate, ChatMessage,MessageRole from llama_index.core.chat_engine import CondenseQuestionChatEngine, ContextChatEngine

from llama_index.legacy.prompts import ChatPromptTemplate

from llama_index.core.base.llms.types import ChatMessage, MessageRole

try: from llama_index import VectorStoreIndex, ServiceContext, Document, SimpleDirectoryReader, StorageContext, load_index_from_storage except ImportError: from llama_index.core import VectorStoreIndex, ServiceContext, Document, SimpleDirectoryReader, StorageContext, load_index_from_storage

from llama_index.llms.huggingface import ( HuggingFaceInferenceAPI, HuggingFaceLLM, ) from llama_index.llms.llama_cpp.llama_utils import ( messages_to_prompt, completion_to_prompt, ) from huggingface_hub import login

login("MY_HUGGINGFACE_API_KEY")

from transformers import AutoTokenizer

st.set_page_config(page_title="Chat with the Streamlit docs, powered by LlamaIndex", page_icon="πŸ¦™", layout="centered", initial_sidebar_state="auto", menu_items=None)

st.title("Chat with the Streamlit docs, powered by LlamaIndex πŸ’¬πŸ¦™") st.info("Check out the full tutorial to build this app in our blog post", icon="πŸ“ƒ")

from llama_index.core import PromptTemplate

prompt_template = """### System: Use the following pieces of information to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer. Only return the helpful answer below and nothing else. Helpful answer: """

if "messages" not in st.session_state.keys(): # Initialize the chat messages history st.session_state.messages = [ {"role": "assistant", "content": "Ask me a question about the E-books!"} ]

PERSIST_DIR = "./storage"

@st.cache_resource(show_spinner=False) def load_data(): with st.spinner(text="Loading and indexing the E-books – hang tight! This should take 1-2 minutes."):

reader = SimpleDirectoryReader(input_dir="./data", recursive=True)

    # docs = reader.load_data()
    llm = HuggingFaceInferenceAPI(
            generate_kwargs={"temperature": 0.0},
            model_name="meta-llama/Llama-2-70b-chat-hf",
    )
    model_name = "BAAI/bge-large-en"
    model_kwargs = {'device': 'cpu'}
    encode_kwargs = {'normalize_embeddings': False}
    embed_model = HuggingFaceBgeEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs
    )
    service_context=ServiceContext.from_defaults(
                chunk_size=1000,
                chunk_overlap=100,
                embed_model=embed_model,
                llm=llm
    )
    set_global_service_context(service_context)
    if not os.path.exists(PERSIST_DIR):  
        reader = SimpleDirectoryReader(input_dir="./data", recursive=True)
        Settings.llm = llm
        Settings.embed_model = embed_model
        docs = reader.load_data()
        index = VectorStoreIndex.from_documents(documents=docs, service_context=service_context)
        index.storage_context.persist(persist_dir=PERSIST_DIR)
        st.write("LoadEmbedding>>>", index)
        return index
    else:
    # load the existing index
        Settings.llm = llm
        Settings.embed_model = embed_model
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        index = load_index_from_storage(storage_context)
        st.write("StoredEmbedding>>>", index)
        return index

index = load_data()

def generate_text(prompt):

question = ("tell me a story with a lesson?")

qa_prompt_str = ( "Context information is below.\n" "---------------------\n" "{context_str}\n" "---------------------\n" "Given only the context information and not prior knowledge, " "answer the question: {query_str}\n" )

refine_prompt_str = ( "We have the opportunity to refine the original answer " "(only if needed) with some more context below.\n" "------------\n" "{context_msg}\n" "------------\n" "Given the new context, refine the original answer to better " "answer the question: {query_str}. " "If the context isn't useful, output the original answer again.\n" "Original Answer: {existing_answer}" )

chat_text_qa_msgs = [ ChatMessage( role=MessageRole.SYSTEM, content=( prompt_template ), ), ChatMessage( role=MessageRole.USER, content=( qa_prompt_str ), ), ] text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

Refine Prompt

chat_refine_msgs = [ ChatMessage( role=MessageRole.SYSTEM, content=( "If the context isn't helpful, just say I don't know. Don't any add informtion into the answer that is not available in the context" ), ), ChatMessage( role=MessageRole.USER, content=( "New Context: {context_msg}\n" "Query: {query_str}\n" "Original Answer: {existing_answer}\n" "New Answer: " ), ), ] refine_template = ChatPromptTemplate(chat_refine_msgs)

custom_prompt = PromptTemplate( """\ Given a conversation (between Human and Assistant) and a follow up message from Human, \ rewrite the message to be a standalone question that captures all relevant context \ from the conversation.

{chat_history} {question} """ ) # list of `ChatMessage` objects custom_chat_history = [ ChatMessage( role=MessageRole.USER, content="Hello assistant, we are having an insightful discussion about the given content and you are helping me understand the content by answering or summerizing and explaining me the content without changing its true meaning.", ), ChatMessage(role=MessageRole.ASSISTANT, content="Okay, sounds good."), ] if "chat_engine" not in st.session_state.keys(): # Initialize the chat engine query_engine=( index.as_query_engine(text_qa_template=text_qa_template, refine_template=refine_template, llm=Settings.llm) ) chat_engine = CondenseQuestionChatEngine.from_defaults( query_engine=query_engine, condense_question_prompt=custom_prompt, chat_history=custom_chat_history, verbose=True, ) st.session_state.chat_engine = chat_engine if prompt := st.chat_input("Your question"): # Prompt for user input and save to chat history st.session_state.messages.append({"role": "user", "content": prompt}) for message in st.session_state.messages: # Display the prior chat messages with st.chat_message(message["role"]): st.write(message["content"]) # If last message is not from assistant, generate a new response if st.session_state.messages[-1]["role"] != "assistant": with st.chat_message("assistant"): with st.spinner("Thinking..."): response = st.session_state.chat_engine.chat(prompt) # response = st.session_state.chat_engine.stream_chat(prompt,chat_engine.chat_history) st.write(response.response) message = {"role": "assistant", "content": response.response} st.session_state.messages.append(message) # Add response to message history` I am now sharing some of the examples:- 1) here i asked about Mumbai and a story knowledge about both of them is given by me in the form of pdf. ![image](https://github.com/run-llama/llama_index/assets/42732613/58de5231-d951-435e-96bf-5ac41bd8a701) as you can see it answered both the 1 and 2 question correctly. But the for the 3rd question it was querring a wrong question all together. It was querring with this:- ![image](https://github.com/run-llama/llama_index/assets/42732613/d5648210-adb2-412d-b38b-a89000521c74) and i don't know why it is spamming the queries as if they are in some kind of a loop. I know there are lot of flaws in this code but please understand i am new to AI/ML so pls kindly help!!! I think the project is near completion just few adjustment is needed just confused of what should i do your help would be much appreciated.
dosubot[bot] commented 2 months ago

To address the issues you're encountering with your RAG chatbot, here are targeted solutions:

  1. For the short answer length, explore the max_length, min_length, and length_penalty parameters of the HuggingFaceInferenceAPI. Adjusting these can help control the verbosity of the model's responses. More details on these parameters can be found in the HuggingFace documentation.

  2. To improve follow-up question accuracy, refine the condensing prompt or employ semantic search techniques. This involves enhancing the chatbot's ability to understand and generate relevant follow-up questions by leveraging the context more effectively.

  3. To prevent model overload, consider implementing caching for frequent queries and using a load balancer or queue system to manage request loads. Additionally, optimizing the retrieval process by limiting the context considered for each query can help manage the model's workload.

  4. Enhancing contextual understanding with HuggingFace Embeddings could significantly improve the chatbot's performance. By utilizing sentence or document embeddings, the chatbot can achieve a better semantic understanding of both the queries and the e-book content, leading to more accurate responses.

Regarding the issue of the chatbot spamming queries as if in a loop, it's crucial to review the CondenseQuestionChatEngine logic to ensure there's no repetitive loop in response generation.

These steps should help you refine your chatbot's performance and address the specific challenges you've outlined.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.