langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.41k stars 14.54k forks source link

How to make a Chatbot responde based on custom data and from the internet? #8816

Closed venturaEffect closed 9 months ago

venturaEffect commented 1 year ago

Issue you'd like to raise.

I've updated the code but very strange it doesn't find a good response. When I print(response["answer"]) I get that there is no text to give to the query I put. Even if it gets information from the internet and the Document on the list seems good structured. Here the code:

`

    from googlesearch import search
    from langchain.chains import ConversationalRetrievalChain
    from langchain.memory import ConversationBufferMemory
    from langchain.vectorstores import DocArrayInMemorySearch
    from langchain.embeddings.openai import OpenAIEmbeddings
    from langchain.document_loaders import (
        UnstructuredWordDocumentLoader,
        TextLoader,
        UnstructuredPowerPointLoader,
    )
    from langchain.tools import Tool
    from langchain.utilities import GoogleSearchAPIWrapper
    from langchain.chat_models import ChatOpenAI
    from langchain.docstore.document import Document
    import os
    import openai
    import sys
    from dotenv import load_dotenv, find_dotenv

    sys.path.append('../..')

    _ = load_dotenv(find_dotenv())  

    google_api_key = os.environ.get("GOOGLE_API_KEY")
    google_cse_id = os.environ.get("GOOGLE_CSE_ID")

    openai.api_key = os.environ['OPENAI_API_KEY']

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
    os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

    os.environ["GOOGLE_API_KEY"] = google_api_key
    os.environ["GOOGLE_CSE_ID"] = google_cse_id

    folder_path_docx = "DB\\DB VARIADO\\DOCS"
    folder_path_txt = "DB\\BLOG-POSTS"
    folder_path_pptx_1 = "DB\\PPT JUNIO"
    folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

    loaded_content = []

    for file in os.listdir(folder_path_docx):
        if file.endswith(".docx"):
            file_path = os.path.join(folder_path_docx, file)
            loader = UnstructuredWordDocumentLoader(file_path)
            docx = loader.load()
            loaded_content.extend(docx)

    for file in os.listdir(folder_path_txt):
        if file.endswith(".txt"):
            file_path = os.path.join(folder_path_txt, file)
            loader = TextLoader(file_path, encoding='utf-8')
            text = loader.load()
            loaded_content.extend(text)

    for file in os.listdir(folder_path_pptx_1):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_1, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_1 = loader.load()
            loaded_content.extend(slides_1)

    for file in os.listdir(folder_path_pptx_2):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_2, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_2 = loader.load()
            loaded_content.extend(slides_2)

    embedding = OpenAIEmbeddings()

    embeddings_content = []
    for one_loaded_content in loaded_content:
        embedding_content = embedding.embed_query(one_loaded_content.page_content)
        embeddings_content.append(embedding_content)

    db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

    search = GoogleSearchAPIWrapper()

    def custom_search(query):
        max_results = 3

        internet_results = search.results(query, max_results)
        internet_documents = [Document(page_content=result["snippet"], metadata={
                                       "source": result["link"]}) for result in internet_results
                              ]
        return internet_documents

    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        chain_type="map_reduce",
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )

    history = []

    while True:
        query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

        internet_documents = custom_search(query)

        small = loaded_content[:3]

        combined_results = small + internet_documents
        print(combined_results)

        response = chain(
            {"question": query, "chat_history": history, "documents": combined_results})

        print(response["answer"])

        history.append(("system", query))
        history.append(("assistant", response["answer"]))

`

Can anyone help me to make it work? Appreciate!

Suggestion:

I would like that the Chatbot gives a response not just from the custom data also from what it gets from the internet. But of what I've done so far it doesn't work.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

From your code, it seems like you're trying to combine results from your local documents and the internet to generate responses for your chatbot. However, it's not clear how the ConversationalRetrievalChain is handling the documents from the internet.

The ConversationalRetrievalChain in LangChain is designed to handle the retrieval of responses based on the provided documents in a conversational context. It takes in chat history and new questions, and then returns an answer to that question. The algorithm for this chain consists of three parts:

  1. Use the chat history and the new question to create a "standalone question". This is done so that this question can be passed into the retrieval step to fetch relevant documents. If only the new question was passed in, then relevant context may be lacking. If the whole conversation was passed into retrieval, there may be unnecessary information there that would distract from retrieval.

  2. This new question is passed to the retriever and relevant documents are returned.

  3. The retrieved documents are passed to an LLM along with either the new question (default behavior) or the original question and chat history to generate a final response.

Here is the relevant code:

class ConversationalRetrievalChain(BaseConversationalRetrievalChain):
    def _call(
        self,
        inputs: Dict[str, Any],
        run_manager: Optional[CallbackManagerForChainRun] = None,
    ) -> Dict[str, Any]:
        _run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
        question = inputs["question"]
        get_chat_history = self.get_chat_history or _get_chat_history
        chat_history_str = get_chat_history(inputs["chat_history"])
        if chat_history_str:
            callbacks = _run_manager.get_child()
            new_question = self.question_generator.run(
                question=question, chat_history=chat_history_str, callbacks=callbacks
            )
        else:
            new_question = question
        docs = self._get_docs(new_question, inputs, run_manager=_run_manager)
        new_inputs = inputs.copy()
        if self.rephrase_question:
            new_inputs["question"] = new_question
        new_inputs["chat_history"] = chat_history_str
        answer = self.combine_docs_chain.run(
            input_documents=docs, callbacks=_run_manager.get_child(), **new_inputs
        )
        output: Dict[str, Any] = {self.output_key: answer}
        if self.return_source_documents:
            output["source_documents"] = docs
        if self.return_generated_question:
            output["generated_question"] = new_question
        return output

In your case, you might need to ensure that the documents retrieved from the internet are correctly processed and passed to the ConversationalRetrievalChain.

Also, it's important to note that the ConversationalRetrievalChain uses the combine_docs_chain.run method to generate the final response. This method takes in the retrieved documents and the new question or the original question and chat history. If the documents retrieved from the internet are not correctly processed or if they do not contain relevant information, the final response might not be as expected.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

dosubot[bot] commented 9 months ago

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you raised was about the code not being able to find a good response when querying custom data and information from the internet. I provided an answer explaining how the ConversationalRetrievalChain in LangChain is designed to handle the retrieval of responses based on provided documents in a conversational context. I also shared relevant code snippets and sources for further reference. However, both you and I reacted with a thumbs down to the comment.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!