langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.4k stars 14.54k forks source link

How to connect a Chatbot that has its own data but has also access to internet for search? #8625

Closed venturaEffect closed 9 months ago

venturaEffect commented 1 year ago

Issue you'd like to raise.

Hi!

How to make a Chatbot that uses its own data access to the internet to get more info (like new updated)? I've tried and searched everywhere but can't make it work.

Here the code:

` from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from langchain.vectorstores import DocArrayInMemorySearch from langchain.embeddings.openai import OpenAIEmbeddings from langchain.document_loaders import ( UnstructuredWordDocumentLoader, TextLoader, UnstructuredPowerPointLoader, ) from langchain.tools import Tool from langchain.utilities import GoogleSearchAPIWrapper from langchain.chat_models import ChatOpenAI import os import openai import sys from dotenv import load_dotenv, find_dotenv

    sys.path.append('../..')

    _ = load_dotenv(find_dotenv())  # read local .env file

    google_api_key = os.environ.get("GOOGLE_API_KEY")
    google_cse_id = os.environ.get("GOOGLE_CSE_ID")

    openai.api_key = os.environ['OPENAI_API_KEY']

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
    os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

    os.environ["GOOGLE_API_KEY"] = google_api_key
    os.environ["GOOGLE_CSE_ID"] = google_cse_id

    folder_path_docx = "DB\\DB VARIADO\\DOCS"
    folder_path_txt = " DB\\BLOG-POSTS"
    folder_path_pptx_1 = "DB\\PPT JUNIO"
    folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

    loaded_content = []

    for file in os.listdir(folder_path_docx):
        if file.endswith(".docx"):
            file_path = os.path.join(folder_path_docx, file)
            loader = UnstructuredWordDocumentLoader(file_path)
            docx = loader.load()
            loaded_content.extend(docx)

    for file in os.listdir(folder_path_txt):
        if file.endswith(".txt"):
            file_path = os.path.join(folder_path_txt, file)
            loader = TextLoader(file_path, encoding='utf-8')
            text = loader.load()
            loaded_content.extend(text)

    for file in os.listdir(folder_path_pptx_1):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_1, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_1 = loader.load()
            loaded_content.extend(slides_1)

    for file in os.listdir(folder_path_pptx_2):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_2, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_2 = loader.load()
            loaded_content.extend(slides_2)

    embedding = OpenAIEmbeddings()

    embeddings_content = []
    for one_loaded_content in loaded_content:
        embedding_content = embedding.embed_query(one_loaded_content.page_content)
        embeddings_content.append(embedding_content)

    db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

    search = GoogleSearchAPIWrapper()

    def custom_search(query):
        max_results = 3

        internet_results = search.run(query)[:max_results]
        return internet_results

    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        chain_type="map_reduce",
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )

    history = []

    while True:
        query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

        internet_results = custom_search(query)

        combined_results = loaded_content + [internet_results]

        response = chain(
            {"question": query, "chat_history": history, "documents": combined_results})

        print(response["answer"])

        history.append(("system", query))
        history.append(("assistant", response["answer"])) 

`

This is the error message I get: "The document does not provide information on... ". So it seems it doesn't access to the internet or something else (?)

Really appreciate your suggestion or your help!

Suggestion:

No response

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hi,

From your code, it seems like you're trying to combine the results from your local documents and the internet search into one list and then pass it to the ConversationalRetrievalChain. However, the ConversationalRetrievalChain expects the documents parameter to be a list of Document objects, not a list of strings or other data types.

The custom_search function returns a list of search results, but it's not clear what the format of these results is. If they're not Document objects, that could be causing the issue. You might need to convert these search results into Document objects before adding them to the combined_results list.

Here's an example of how you might do this:

def custom_search(query):
    max_results = 3

    internet_results = search.run(query)[:max_results]
    # Convert the search results into Document objects
    internet_documents = [Document(page_content=result) for result in internet_results]
    return internet_documents

# ...

while True:
    query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

    internet_documents = custom_search(query)

    combined_results = loaded_content + internet_documents

    response = chain(
        {"question": query, "chat_history": history, "documents": combined_results})

    print(response["answer"])

    history.append(("system", query))
    history.append(("assistant", response["answer"]))

In this code, the custom_search function has been modified to convert the search results into Document objects. Then, in the main loop, the internet_documents are added to the combined_results list instead of the raw internet_results.

Please note that this is just a suggestion and the actual solution might differ based on the exact format of your search results and the specific requirements of your application.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

venturaEffect commented 1 year ago

I get "Dcoument" is not defined.

Added this: from langchain.docstore.document import Document

But it still seems it hasn't access to internet. Any other suggestion?

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

From your code, it seems like you're trying to use the Document class from the langchain.docstore.document module. However, the error message "Document is not defined" suggests that Python can't find a reference to the Document class. This could be due to a few reasons:

  1. The Document class is not correctly imported. Make sure that the import statement from langchain.docstore.document import Document is at the top of your script and there are no typos.

  2. The Document class is not available in the version of LangChain you're using. Ensure you're using the latest version of LangChain, which is v0.0.250.

  3. There might be a conflict with another module or class named Document in your script. If you have another module or class with the same name, consider renaming it or using an alias in the import statement.

Regarding the issue of the Chatbot not accessing the internet, it's hard to pinpoint the exact issue without more information. However, I can suggest a few things:

  1. Make sure your environment has internet access. You can test this by running a simple Python script that makes a request to a website.

  2. Check if the GoogleSearchAPIWrapper is correctly set up and working. You can do this by running a separate script that only uses the GoogleSearchAPIWrapper to make a request.

  3. Ensure that the custom_search function is correctly implemented and is returning the expected results. You can test this by running the function separately with a test query and checking the output.

If you're still encountering issues, could you please provide more information about the error messages you're seeing, and any other relevant details about your setup? This will help me better understand the problem and provide a more accurate solution.

Best, Dosu

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

venturaEffect commented 1 year ago

Document correctly imported, upgrade on Langchain done, GoogleSearchAPIWrapper on a simple script working,... Everything seems fine but when running this code it still doesn't get access to the internet:

` from googlesearch import search from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from langchain.vectorstores import DocArrayInMemorySearch from langchain.embeddings.openai import OpenAIEmbeddings from langchain.document_loaders import ( UnstructuredWordDocumentLoader, TextLoader, UnstructuredPowerPointLoader, ) from langchain.tools import Tool from langchain.utilities import GoogleSearchAPIWrapper from langchain.chat_models import ChatOpenAI from langchain.docstore.document import Document import os import openai import sys from dotenv import load_dotenv, find_dotenv

    sys.path.append('../..')

    _ = load_dotenv(find_dotenv())  # read local .env file

    google_api_key = os.environ.get("GOOGLE_API_KEY")
    google_cse_id = os.environ.get("GOOGLE_CSE_ID")

    openai.api_key = os.environ['OPENAI_API_KEY']

    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
    os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

    os.environ["GOOGLE_API_KEY"] = google_api_key
    os.environ["GOOGLE_CSE_ID"] = google_cse_id

    folder_path_docx = "DB\\DB VARIADO\\DOCS"
    folder_path_txt = "DB\\-BLOG-POSTS"
    folder_path_pptx_1 = "DB\\PPT JUNIO"
    folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

    loaded_content = []

    for file in os.listdir(folder_path_docx):
        if file.endswith(".docx"):
            file_path = os.path.join(folder_path_docx, file)
            loader = UnstructuredWordDocumentLoader(file_path)
            docx = loader.load()
            loaded_content.extend(docx)

    for file in os.listdir(folder_path_txt):
        if file.endswith(".txt"):
            file_path = os.path.join(folder_path_txt, file)
            loader = TextLoader(file_path, encoding='utf-8')
            text = loader.load()
            loaded_content.extend(text)

    for file in os.listdir(folder_path_pptx_1):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_1, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_1 = loader.load()
            loaded_content.extend(slides_1)

    for file in os.listdir(folder_path_pptx_2):
        if file.endswith(".pptx"):
            file_path = os.path.join(folder_path_pptx_2, file)
            loader = UnstructuredPowerPointLoader(file_path)
            slides_2 = loader.load()
            loaded_content.extend(slides_2)

    embedding = OpenAIEmbeddings()

    embeddings_content = []
    for one_loaded_content in loaded_content:
        embedding_content = embedding.embed_query(one_loaded_content.page_content)
        embeddings_content.append(embedding_content)

    db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

    search = GoogleSearchAPIWrapper()

    def custom_search(query):
        max_results = 3

        internet_results = search.run(query)[:max_results]

        internet_documents = [Document(page_content=result)
                              for result in internet_results]
        return internet_documents

    chain = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        chain_type="map_reduce",
        retriever=retriever,
        return_source_documents=True,
        return_generated_question=True,
    )

    history = []

    while True:
        query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

        internet_documents = custom_search(query)

        combined_results = loaded_content + internet_documents

        response = chain(
            {"question": query, "chat_history": history, "documents": combined_results})

        print(response["answer"])

        history.append(("system", query))
        history.append(("assistant", response["answer"]))

`

What can I do to make it work? Appreciate.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

Hello,

Thank you for providing a detailed description of your issue. From the code you've shared, it seems like you've done a thorough job setting up your environment and implementing the necessary functions. However, to better assist you, I need a bit more information:

  1. Could you please provide the exact error message you're encountering? This will help me understand where the issue might be coming from.
  2. Have you been able to successfully make requests to the internet outside of the LangChain framework in your current environment? This could help us determine if the issue is with the LangChain framework or with the environment itself.
  3. Are you able to successfully retrieve and use data from your local files (DOCS, BLOG-POSTS, PPT JUNIO, DB VARIADO)?
  4. Have you tried running a minimal version of your code that only attempts to access the internet, without the other functionalities? This could help isolate the issue.

Your answers to these questions will help me provide a more accurate solution to your problem. Looking forward to your response.


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

venturaEffect commented 1 year ago
  1. The message I get when running the script is: "The text does not provide information on the... (and the query I asked to the chatbot)"
  2. Yes I could.
  3. Yes, it also works.
  4. Yes I made another chatbot that just answers looking in internet and it works.

In conclussion it is something wrong on my code that doesn't allow the chatbot to look at the custom data and also on the internet. Please need someone that has knowledge to accomplish a solution.

Here again the code I've so far:

` from googlesearch import search from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from langchain.vectorstores import DocArrayInMemorySearch from langchain.embeddings.openai import OpenAIEmbeddings from langchain.document_loaders import ( UnstructuredWordDocumentLoader, TextLoader, UnstructuredPowerPointLoader, ) from langchain.tools import Tool from langchain.utilities import GoogleSearchAPIWrapper from langchain.chat_models import ChatOpenAI from langchain.docstore.document import Document import os import openai import sys from dotenv import load_dotenv, find_dotenv

sys.path.append('../..')

_ = load_dotenv(find_dotenv())  # read local .env file

google_api_key = os.environ.get("GOOGLE_API_KEY")
google_cse_id = os.environ.get("GOOGLE_CSE_ID")

openai.api_key = os.environ['OPENAI_API_KEY']

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

os.environ["GOOGLE_API_KEY"] = google_api_key
os.environ["GOOGLE_CSE_ID"] = google_cse_id

folder_path_docx = "DB\\DB VARIADO\\DOCS"
folder_path_txt = "DB\\BLOG-POSTS"
folder_path_pptx_1 = "DB\\PPT JUNIO"
folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

loaded_content = []

for file in os.listdir(folder_path_docx):
    if file.endswith(".docx"):
        file_path = os.path.join(folder_path_docx, file)
        loader = UnstructuredWordDocumentLoader(file_path)
        docx = loader.load()
        loaded_content.extend(docx)

for file in os.listdir(folder_path_txt):
    if file.endswith(".txt"):
        file_path = os.path.join(folder_path_txt, file)
        loader = TextLoader(file_path, encoding='utf-8')
        text = loader.load()
        loaded_content.extend(text)

for file in os.listdir(folder_path_pptx_1):
    if file.endswith(".pptx"):
        file_path = os.path.join(folder_path_pptx_1, file)
        loader = UnstructuredPowerPointLoader(file_path)
        slides_1 = loader.load()
        loaded_content.extend(slides_1)

for file in os.listdir(folder_path_pptx_2):
    if file.endswith(".pptx"):
        file_path = os.path.join(folder_path_pptx_2, file)
        loader = UnstructuredPowerPointLoader(file_path)
        slides_2 = loader.load()
        loaded_content.extend(slides_2)

embedding = OpenAIEmbeddings()

embeddings_content = []
for one_loaded_content in loaded_content:
    embedding_content = embedding.embed_query(one_loaded_content.page_content)
    embeddings_content.append(embedding_content)

db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

search = GoogleSearchAPIWrapper()

def custom_search(query):
    max_results = 3

    internet_results = search.run(query)[:max_results]

    internet_documents = [Document(page_content=result)
                          for result in internet_results]
    return internet_documents

chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(model_name="gpt-4", temperature=0),
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True,
    return_generated_question=True,
)

history = []

while True:
    query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

    internet_documents = custom_search(query)

    combined_results = loaded_content + internet_documents

    response = chain(
        {"question": query, "chat_history": history, "documents": combined_results})

    print(response["answer"])

    history.append(("system", query))
    history.append(("assistant", response["answer"]))

`

NOTE: There has to be a problem with how internet_documents is its output. Tried several ways adding it like how it displays loadded_content but it still doesn't work. loaded_content shows like this: [Document(page_content="text...", metadata={"source"= "path"}]... changed the code in custom_search in internet_documents = [Document(page_content=result) for result in internet_results] but just got Document(page_content='E', metadata={}), ...)] so it isn't getting the content properly from internet.

Appreciate any help!

venturaEffect commented 1 year ago

I've updated the code but very strange it doesn't find a good response. When I print(response["answer"]) I get that there is no text to give to the query I put. Even if it gets information from the internet and the Document on the list seems good structured. Here the code:

`

        from googlesearch import search
        from langchain.chains import ConversationalRetrievalChain
        from langchain.memory import ConversationBufferMemory
        from langchain.vectorstores import DocArrayInMemorySearch
        from langchain.embeddings.openai import OpenAIEmbeddings
        from langchain.document_loaders import (
            UnstructuredWordDocumentLoader,
            TextLoader,
            UnstructuredPowerPointLoader,
        )
        from langchain.tools import Tool
        from langchain.utilities import GoogleSearchAPIWrapper
        from langchain.chat_models import ChatOpenAI
        from langchain.docstore.document import Document
        import os
        import openai
        import sys
        from dotenv import load_dotenv, find_dotenv

        sys.path.append('../..')

        _ = load_dotenv(find_dotenv())  

        google_api_key = os.environ.get("GOOGLE_API_KEY")
        google_cse_id = os.environ.get("GOOGLE_CSE_ID")

        openai.api_key = os.environ['OPENAI_API_KEY']

        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
        os.environ["LANGCHAIN_API_KEY"] = os.environ['LANGCHAIN_API_KEY']

        os.environ["GOOGLE_API_KEY"] = google_api_key
        os.environ["GOOGLE_CSE_ID"] = google_cse_id

        folder_path_docx = "DB\\DB VARIADO\\DOCS"
        folder_path_txt = "DB\\BLOG-POSTS"
        folder_path_pptx_1 = "DB\\PPT JUNIO"
        folder_path_pptx_2 = "DB\\DB VARIADO\\PPTX"

        loaded_content = []

        for file in os.listdir(folder_path_docx):
            if file.endswith(".docx"):
                file_path = os.path.join(folder_path_docx, file)
                loader = UnstructuredWordDocumentLoader(file_path)
                docx = loader.load()
                loaded_content.extend(docx)

        for file in os.listdir(folder_path_txt):
            if file.endswith(".txt"):
                file_path = os.path.join(folder_path_txt, file)
                loader = TextLoader(file_path, encoding='utf-8')
                text = loader.load()
                loaded_content.extend(text)

        for file in os.listdir(folder_path_pptx_1):
            if file.endswith(".pptx"):
                file_path = os.path.join(folder_path_pptx_1, file)
                loader = UnstructuredPowerPointLoader(file_path)
                slides_1 = loader.load()
                loaded_content.extend(slides_1)

        for file in os.listdir(folder_path_pptx_2):
            if file.endswith(".pptx"):
                file_path = os.path.join(folder_path_pptx_2, file)
                loader = UnstructuredPowerPointLoader(file_path)
                slides_2 = loader.load()
                loaded_content.extend(slides_2)

        embedding = OpenAIEmbeddings()

        embeddings_content = []
        for one_loaded_content in loaded_content:
            embedding_content = embedding.embed_query(one_loaded_content.page_content)
            embeddings_content.append(embedding_content)

        db = DocArrayInMemorySearch.from_documents(loaded_content, embedding)
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

        search = GoogleSearchAPIWrapper()

        def custom_search(query):
            max_results = 3

            internet_results = search.results(query, max_results)
            internet_documents = [Document(page_content=result["snippet"], metadata={
                                           "source": result["link"]}) for result in internet_results
                                  ]
            return internet_documents

        chain = ConversationalRetrievalChain.from_llm(
            llm=ChatOpenAI(model_name="gpt-4", temperature=0),
            chain_type="map_reduce",
            retriever=retriever,
            return_source_documents=True,
            return_generated_question=True,
        )

        history = []

        while True:
            query = input("Hola, soy Chatbot. ¿Qué te gustaría saber? ")

            internet_documents = custom_search(query)

            small = loaded_content[:3]

            combined_results = small + internet_documents
            print(combined_results)

            response = chain(
                {"question": query, "chat_history": history, "documents": combined_results})

            print(response["answer"])

            history.append(("system", query))
            history.append(("assistant", response["answer"]))

`

Can anyone help me to make it work? Appreicate!

dosubot[bot] commented 9 months ago

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were seeking help to connect a chatbot with its own data to the internet for search. In the comments, there were suggestions and code examples provided by me to convert search results into Document objects and to check the setup and implementation. You made updates to the code but were still encountering issues with accessing the internet and getting a proper response.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

Laxmi-Pravaah commented 1 month ago

how to connect a chatbot to internet for real time access or to give up to date information. using gemini model not gpt-4 ??? Please help me with resources