neo4j-advanced-rag multiple documents

priamai commented 8 months ago

Feature request

I would like to learn how to modify the ingest.py to load multiple text file and include their source url.

Motivation

Currently only one file is indexed.

Your contribution

I know well Neo4J, LangChain but not LangServe. I can help once I get the concept. @efriis and @tomasonjo original author

dosubot[bot] commented 8 months ago

🤖

Hey @priamai, great to cross paths with you again! 🚀 Just diving into the issue you've raised. Hold tight, I'll be back with a full answer for you soon. 🕵️‍♂️🔍

tomasonjo commented 8 months ago

ChatyGPT can help you there. You need to pass source name as parameter in the query and store it as a node property or a separate node

priamai commented 8 months ago

I understand I am doing an experiment now. Will let you know.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Tomaz Bratanic @.> Sent: Tuesday, January 23, 2024 6:51:25 AM To: langchain-ai/langchain @.> Cc: Paolo Di Prodi @.>; Mention @.> Subject: Re: [langchain-ai/langchain] neo4j-advanced-rag multiple documents (Issue #16412)

ChatyGPT can help you there. You need to pass source name as parameter in the query and store it as a node property or a separate node

— Reply to this email directly, view it on GitHubhttps://github.com/langchain-ai/langchain/issues/16412#issuecomment-1905390311, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANVNMBV4M4F47OAI4FCOGWDYP5MW3AVCNFSM6AAAAABCF7YEDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBVGM4TAMZRGE. You are receiving this because you were mentioned.Message ID: @.***>

priamai commented 8 months ago

I also see a lot of deprecation warnings, we need to upgrade everything to the latest version of LangChain. The enumeration ID for the parents is also problematic, as I add more documents into the folder, the integer will be randomized based on the OS, I should replace that with the filename. Also I don't know in the UI, how should we present the source information with the response? How do we propagate the source info into the stringified response?

chain = (
    RunnableParallel(
        {
            "context": itemgetter("question") | retriever,
            "question": itemgetter("question"),
        }
    )
    | prompt
    | model
    | StrOutputParser()
)

I am going to create a separate repo so you can see my attempts.

dosubot[bot] commented 7 months ago

🤖

To modify the ingest.py file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:

Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader and call the load method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url after loading the document.
Handle deprecation warnings: You can use the warnings module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning) at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade flag. You can run pip install --upgrade langchain in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url) after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents method to include this new field in the stringified response.

Here is the modified code:

import os
import warnings

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient

# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

MONGO_URI = os.environ["MONGO_URI"]

DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]

if __name__ == "__main__":
    # Load docs
    urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
    for url in urls:
        loader = PyPDFLoader(url)
        data = loader.load()

        # Add source URL to the document
        for doc in data:
            doc['source'] = url

        # Split docs
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
        docs = text_splitter.split_documents(data)

        # Replace enumeration ID for parents with the filename
        for doc in docs:
            doc['parent'] = os.path.basename(url)

        # Insert the documents in MongoDB Atlas Vector Search
        _ = MongoDBAtlasVectorSearch.from_documents(
            documents=docs,
            embedding=OpenAIEmbeddings(disallowed_special=()),
            collection=MONGODB_COLLECTION,
            index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
        )

Please note that this code assumes that the split_documents and from_documents methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain in your terminal.

For more information, you can refer to the following sources:

templates/rag-mongo/ingest.py

Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.

As for the propagation of the source info into the stringified response, you could modify the StrOutputParser to include the source info. You would need to modify the parse method of StrOutputParser to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 7 months ago

🤖

To modify the ingest.py file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:

Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader and call the load method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url after loading the document.
Handle deprecation warnings: You can use the warnings module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning) at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade flag. You can run pip install --upgrade langchain in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url) after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents method to include this new field in the stringified response.

Here is the modified code:

import os
import warnings

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient

# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

MONGO_URI = os.environ["MONGO_URI"]

DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]

if __name__ == "__main__":
    # Load docs
    urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
    for url in urls:
        loader = PyPDFLoader(url)
        data = loader.load()

        # Add source URL to the document
        for doc in data:
            doc['source'] = url

        # Split docs
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
        docs = text_splitter.split_documents(data)

        # Replace enumeration ID for parents with the filename
        for doc in docs:
            doc['parent'] = os.path.basename(url)

        # Insert the documents in MongoDB Atlas Vector Search
        _ = MongoDBAtlasVectorSearch.from_documents(
            documents=docs,
            embedding=OpenAIEmbeddings(disallowed_special=()),
            collection=MONGODB_COLLECTION,
            index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
        )

Please note that this code assumes that the split_documents and from_documents methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain in your terminal.

For more information, you can refer to the following sources:

templates/rag-mongo/ingest.py

Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.

As for the propagation of the source info into the stringified response, you could modify the StrOutputParser to include the source info. You would need to modify the parse method of StrOutputParser to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 7 months ago

🤖

To modify the ingest.py file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:

Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader and call the load method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url after loading the document.
Handle deprecation warnings: You can use the warnings module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning) at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade flag. You can run pip install --upgrade langchain in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url) after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents method to include this new field in the stringified response.

Here is the modified code:

import os
import warnings

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient

# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

MONGO_URI = os.environ["MONGO_URI"]

DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]

if __name__ == "__main__":
    # Load docs
    urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
    for url in urls:
        loader = PyPDFLoader(url)
        data = loader.load()

        # Add source URL to the document
        for doc in data:
            doc['source'] = url

        # Split docs
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
        docs = text_splitter.split_documents(data)

        # Replace enumeration ID for parents with the filename
        for doc in docs:
            doc['parent'] = os.path.basename(url)

        # Insert the documents in MongoDB Atlas Vector Search
        _ = MongoDBAtlasVectorSearch.from_documents(
            documents=docs,
            embedding=OpenAIEmbeddings(disallowed_special=()),
            collection=MONGODB_COLLECTION,
            index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
        )

Please note that this code assumes that the split_documents and from_documents methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain in your terminal.

For more information, you can refer to the following sources:

templates/rag-mongo/ingest.py

Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.

As for the propagation of the source info into the stringified response, you could modify the StrOutputParser to include the source info. You would need to modify the parse method of StrOutputParser to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.

I hope this helps! If you have any further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

priamai commented 7 months ago

What a bunch of garbage that was, welcome to the era of code assisted garbage!

langchain-ai / langchain