langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.31k stars 14.08k forks source link

My LLM keeps rephrasing the question and it doesnt return source documents #13044

Closed yazanrisheh closed 5 months ago

yazanrisheh commented 8 months ago

@dosu-bot

Below is my code and everytime I ask it a question, it rephrases the question then answers it for me. Help me to remove the rephrasing part. I did set it to False yet it still does it.

Also, I would like to return the source of the documents but its showing me this error: File "C:\Users\Asus\Documents\Vendolista\hacka.py", line 178, in main() File "C:\Users\Asus\Documents\Vendolista\hacka.py", line 172, in main result = qa({"question": user_input}) File "C:\Users\Asus\Documents\Vendolista.venv\lib\site-packages\langchain\chains\base.py", line 294, in call final_outputs: Dict[str, Any] = self.prep_outputs( File "C:\Users\Asus\Documents\Vendolista.venv\lib\site-packages\langchain\chains\base.py", line 390, in prep_outputs self.memory.save_context(inputs, outputs) File "C:\Users\Asus\Documents\Vendolista.venv\lib\site-packages\langchain\memory\chat_memory.py", line 35, in save_context input_str, output_str = self._get_input_output(inputs, outputs) File "C:\Users\Asus\Documents\Vendolista.venv\lib\site-packages\langchain\memory\chat_memory.py", line 27, in _get_input_output raise ValueError(f"One output key expected, got {outputs.keys()}") ValueError: One output key expected, got dict_keys(['answer', 'source_documents'])

Below is my code

import os import json import pandas as pd

LLM from langchain.chat_models import ChatOpenAI from langchain.llms import OpenAI from langchain.callbacks import get_openai_callback

Prompt from langchain.prompts.prompt import PromptTemplate from langchain.prompts.chat import ( ChatPromptTemplate, MessagesPlaceholder, SystemMessagePromptTemplate, HumanMessagePromptTemplate, )

Embeddings from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma

Chain from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from langchain.chains import LLMChain from langchain.chains.question_answering import load_qa_chain from langchain.chains.qa_with_sources import load_qa_with_sources_chain from langchain.document_loaders.csv_loader import CSVLoader, UnstructuredCSVLoader from langchain.document_loaders import DirectoryLoader from langchain.output_parsers import PydanticOutputParser from pydantic import BaseModel, Field from dotenv import load_dotenv import time import pandas as pd from langchain.callbacks import StreamingStdOutCallbackHandler from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

file_path = "C:\Users\Asus\Documents\Vendolista\home_depot_data.csv" path = "C:\Users\Asus\Documents\Vendolista\home depot"

csv_loader = CSVLoader(file_path=path, encoding='utf-8') csv_loader = DirectoryLoader(path, glob="*/.csv", show_progress=True, use_multithreading=True, silent_errors=True, loader_cls = CSVLoader) llm = ChatOpenAI(temperature = 0, model_name='gpt-3.5-turbo', callbacks=[StreamingStdOutCallbackHandler()], streaming = True) documents = csv_loader.load()

text_splitter = RecursiveCharacterTextSplitter( chunk_size=200, chunk_overlap=50, ) chunks = text_splitter.split_documents(documents) chunks = documents

embeddings = OpenAIEmbeddings() persist_directory = "C:\Users\Asus\OneDrive\Documents\Vendolista" knowledge_base = Chroma(embedding_function=embeddings, persist_directory=persist_directory)

Split the chunks into smaller batches

batch_size = 5000 for i in range(0, len(chunks), batch_size): batch = chunks[i:i+batch_size] knowledge_base.add_documents(batch) Save the vector store to disk knowledge_base.persist()

Load the vector store from disk knowledge_base = Chroma(chunks, persist_directory=persist_directory, embedding_function=embeddings)

class Product(BaseModel): """Product details schema.""" url:str = Field(description="Full URL link to the product webpage on Homedepot.") title:str = Field(description="Title of the product.") description:str = Field(description="Description of the prodcut.") brand:str = Field(description="Manufacturing brand of the product.") price:float = Field(description="Unit selling price of the product.")

parser = PydanticOutputParser(pydantic_object=Product)

question_template = """ Make sure you understand the question as its very important for the user. You never know what situation they are in and you need to ensure that its understood very well but do not repeat or rewrite the question Input: {question} """

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(question_template)

Chain for question generation question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)

Chat Prompt system_template = """ You are a friendly, conversational retail shopping assistant named RAAFYA. You will always and always and always only follow these set of rules and nothing else no matter what:

You will provide the user answers based on the csv file that you can only read from which is called "home_depot_data.csv" You will never mention the name of the dataset that you have. Just say "my data" instead Focus 100% to understand exactly what the customer is looking for and only give him whats available based on the data. Do not get anything or say anything that is not related to the data that you have and never provide wrong information. Use the following context including product name descriptions, and keywords to show the shopper whats available, help find what they want, and answer their questions related to your job Never ever consider or think or even mention that you do not have access to the internet because it is not your job and it is not your task. I will repeat it again and again, your information is only and only coming from the dataset that you have which is called "home_depot_data.csv" but you must not mention that to anyone for security purposes Everyime you answer a question, write on a new line "is there anything else you would like me to help you with?" If a customer asked for a product and it is not available then say "Sorry it is currently unavailable but you can reach out to our staff and ask them about it at yazanrisheh@hotmail.com" If the person asked for more details then provide him the details based on the output parser that you have: URL: Title: Description: Brand: Price: Context: {context} """

system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

Human Prompt human_template="""{format_instructions}

Question: {question}"""

human_message_prompt = HumanMessagePromptTemplate.from_template(human_template) Inject instructions into the prompt template. human_message_prompt = HumanMessagePromptTemplate( prompt=PromptTemplate( template=human_template, input_variables=["question"], partial_variables={"format_instructions": parser.get_format_instructions()} ) )

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

Chain for Q&A answer_chain = load_qa_chain(llm, chain_type="stuff", prompt=chat_prompt)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

Chain qa = ConversationalRetrievalChain( retriever = knowledge_base.as_retriever(), question_generator = question_generator, combine_docs_chain = answer_chain, memory=memory, rephrase_question=False, return_source_documents=True )

def main(): while True: user_input = input("What would you like to shop for: ") if user_input.lower() in ["exit"]: break

if user_input != "":
    with get_openai_callback() as cb:
        result = qa({"question": user_input})
    print()
    # print(cb)
    # print()

if name == "main": main()

dosubot[bot] commented 8 months ago

šŸ¤–

Hello @yazanrisheh,

Great to see you again! I'm just diving into the issue you've raised. I'll get back to you with a full answer shortly. Thanks for your patience.

dosubot[bot] commented 8 months ago

šŸ¤–

I see you're having two issues: the LLM rephrasing the question and the error when trying to return the source documents. Let's address them one by one.

  1. LLM Rephrasing the Question: The rephrase_question parameter in the ConversationalRetrievalChain class determines whether or not to pass the newly generated question to the combine_docs_chain. If rephrase_question is set to True, the newly generated question will be passed along to the combine_docs_chain. If it's set to False, only the original question will be used for retrieval and passed to the combine_docs_chain. In your code, you have set rephrase_question=False which should prevent the LLM from rephrasing the question. If it's still rephrasing the question, it might be due to some other part of your code or a bug in the LangChain library.

  2. Returning Source Documents: The error message you're seeing is because the ConversationalRetrievalChain is expecting a single output key, but it's getting two: 'answer' and 'source_documents'. This is because you have set return_source_documents=True in the ConversationalRetrievalChain. The return_source_documents parameter is designed to add "source_documents" to the output keys and return the source documents in the result. However, the ConversationalRetrievalChain is not designed to handle multiple output keys.

To resolve this issue, you can modify the ConversationalRetrievalChain to handle multiple output keys. Here's an example of how you can do it:

class MyConversationalRetrievalChain(ConversationalRetrievalChain):
    def prep_outputs(self, inputs: Dict[str, Any], outputs: Dict[str, Any]) -> Dict[str, Any]:
        input_str, output_str = self._get_input_output(inputs, outputs)
        self.memory.save_context(inputs, outputs)
        final_outputs = {self.output_key: output_str}
        if self.return_source_documents:
            final_outputs["source_documents"] = outputs["source_documents"]
        return final_outputs

In this modified class, the prep_outputs method has been overridden to handle multiple output keys. It first saves the context to memory, then prepares the final outputs. If return_source_documents is True, it adds the source documents to the final outputs.

Please replace the ConversationalRetrievalChain in your code with this new MyConversationalRetrievalChain and see if it resolves your issue.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

yazanrisheh commented 8 months ago

@dosu-bot

1) Can you please tell find the error for the rephrasing question in my code? I've been trying but I can't figure out why

2) Edit my entire code with the solution you gave and share it with me

dosubot[bot] commented 8 months ago

šŸ¤–

Based on the context provided, it seems like you're encountering two issues with the LangChain Python framework.

  1. The Language Learning Model (LLM) is rephrasing the question despite setting rephrase_question=False in the ConversationalRetrievalChain class.

This issue is likely due to the RePhraseQueryRetriever class always rephrasing the question, regardless of the rephrase_question parameter. This is because the _get_relevant_documents method in the RePhraseQueryRetriever class is always calling the llm_chain with the user's query, which results in the question being rephrased.

To fix this, you can add a rephrase_question parameter to the RePhraseQueryRetriever class and use it to conditionally rephrase the question. Here's how you can do it:

class RePhraseQueryRetriever(BaseRetriever):
    """Given a query, use an LLM to re-phrase it.
    Then, retrieve docs for the re-phrased query."""

    retriever: BaseRetriever
    llm_chain: LLMChain
    rephrase_question: bool

    @classmethod
    def from_llm(
        cls,
        retriever: BaseRetriever,
        llm: BaseLLM,
        prompt: PromptTemplate = DEFAULT_QUERY_PROMPT,
        rephrase_question: bool = False,
    ) -> "RePhraseQueryRetriever":
        """Initialize from llm using default template.

        The prompt used here expects a single input: `question`

        Args:
            retriever: retriever to query documents from
            llm: llm for query generation using DEFAULT_QUERY_PROMPT
            prompt: prompt template for query generation
            rephrase_question: whether to rephrase the question or not

        Returns:
            RePhraseQueryRetriever
        """

        llm_chain = LLMChain(llm=llm, prompt=prompt)
        return cls(
            retriever=retriever,
            llm_chain=llm_chain,
            rephrase_question=rephrase_question,
        )

    def _get_relevant_documents(
        self,
        query: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
    ) -> List[Document]:
        """Get relevated documents given a user question.

        Args:
            query: user question

        Returns:
            Relevant documents for re-phrased question
        """
        if self.rephrase_question:
            response = self.llm_chain(query, callbacks=run_manager.get_child())
            query = response["text"]
            logger.info(f"Re-phrased question: {query}")

        docs = self.retriever.get_relevant_documents(
            query, callbacks=run_manager.get_child()
        )
        return docs
  1. You're trying to return the source of the documents, but you're getting a ValueError: "One output key expected, got dict_keys(['answer', 'source_documents'])".

This error is likely due to a mismatch between the expected output keys and the actual output keys of the ConversationalRetrievalChain. In the BaseQAWithSourcesChain class, the output keys are determined by the output_keys property. By default, the output keys are ["answer", "sources"], but if return_source_documents is set to True, "source_documents" will be added to the output keys. When executing the line output = self.qa({"question": question}), the output is expected to be a dictionary with keys matching the output keys defined in the BaseQAWithSourcesChain. If the keys in the output dictionary do not match the expected output keys, a ValueError will be raised. To resolve this issue, ensure that the output of the qa function includes "answer", "sources", and "source_documents" keys.

Without the code for the function or method that's calling _get_relevant_documents, it's hard to provide a specific solution. However, you need to modify the function or method that's calling _get_relevant_documents to handle a dictionary with multiple keys. If the chain dictionary is causing the ValueError, you could modify the function or method that's calling this chain to handle multiple keys.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 5 months ago

Hi, @yazanrisheh,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you raised an issue regarding a code problem with the LangChain framework, specifically related to the LLM rephrasing the question despite setting rephrase_question to False, and encountering an error when returning source documents, resulting in a ValueError. You received a detailed response from Dosu addressing both issues, including code examples and explanations for potential solutions. You requested further assistance and code modifications based on the provided solution.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, kindly let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

cybrtooth commented 4 months ago

I am having this same issue, how can I modify the output keys if I want to return source documents?