Closed priamai closed 4 months ago
🤖
Hey @priamai, great to cross paths with you again! 🚀 Just diving into the issue you've raised. Hold tight, I'll be back with a full answer for you soon. 🕵️♂️🔍
ChatyGPT can help you there. You need to pass source name as parameter in the query and store it as a node property or a separate node
I understand I am doing an experiment now. Will let you know.
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: Tomaz Bratanic @.> Sent: Tuesday, January 23, 2024 6:51:25 AM To: langchain-ai/langchain @.> Cc: Paolo Di Prodi @.>; Mention @.> Subject: Re: [langchain-ai/langchain] neo4j-advanced-rag multiple documents (Issue #16412)
ChatyGPT can help you there. You need to pass source name as parameter in the query and store it as a node property or a separate node
— Reply to this email directly, view it on GitHubhttps://github.com/langchain-ai/langchain/issues/16412#issuecomment-1905390311, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANVNMBV4M4F47OAI4FCOGWDYP5MW3AVCNFSM6AAAAABCF7YEDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBVGM4TAMZRGE. You are receiving this because you were mentioned.Message ID: @.***>
I also see a lot of deprecation warnings, we need to upgrade everything to the latest version of LangChain. The enumeration ID for the parents is also problematic, as I add more documents into the folder, the integer will be randomized based on the OS, I should replace that with the filename. Also I don't know in the UI, how should we present the source information with the response? How do we propagate the source info into the stringified response?
chain = (
RunnableParallel(
{
"context": itemgetter("question") | retriever,
"question": itemgetter("question"),
}
)
| prompt
| model
| StrOutputParser()
)
I am going to create a separate repo so you can see my attempts.
🤖
To modify the ingest.py
file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:
Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader
and call the load
method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url
after loading the document.
Handle deprecation warnings: You can use the warnings
module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning)
at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade
flag. You can run pip install --upgrade langchain
in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents
method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url)
after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents
method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents
method to include this new field in the stringified response.
Here is the modified code:
import os
import warnings
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient
# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
MONGO_URI = os.environ["MONGO_URI"]
DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]
if __name__ == "__main__":
# Load docs
urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
for url in urls:
loader = PyPDFLoader(url)
data = loader.load()
# Add source URL to the document
for doc in data:
doc['source'] = url
# Split docs
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)
# Replace enumeration ID for parents with the filename
for doc in docs:
doc['parent'] = os.path.basename(url)
# Insert the documents in MongoDB Atlas Vector Search
_ = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
Please note that this code assumes that the split_documents
and from_documents
methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain
in your terminal.
For more information, you can refer to the following sources:
Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.
As for the propagation of the source info into the stringified response, you could modify the StrOutputParser
to include the source info. You would need to modify the parse
method of StrOutputParser
to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.
I hope this helps! If you have any further questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
To modify the ingest.py
file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:
Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader
and call the load
method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url
after loading the document.
Handle deprecation warnings: You can use the warnings
module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning)
at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade
flag. You can run pip install --upgrade langchain
in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents
method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url)
after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents
method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents
method to include this new field in the stringified response.
Here is the modified code:
import os
import warnings
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient
# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
MONGO_URI = os.environ["MONGO_URI"]
DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]
if __name__ == "__main__":
# Load docs
urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
for url in urls:
loader = PyPDFLoader(url)
data = loader.load()
# Add source URL to the document
for doc in data:
doc['source'] = url
# Split docs
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)
# Replace enumeration ID for parents with the filename
for doc in docs:
doc['parent'] = os.path.basename(url)
# Insert the documents in MongoDB Atlas Vector Search
_ = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
Please note that this code assumes that the split_documents
and from_documents
methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain
in your terminal.
For more information, you can refer to the following sources:
Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.
As for the propagation of the source info into the stringified response, you could modify the StrOutputParser
to include the source info. You would need to modify the parse
method of StrOutputParser
to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.
I hope this helps! If you have any further questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
To modify the ingest.py
file to load multiple text files, include their source URL, handle deprecation warnings, upgrade to the latest version of LangChain, replace the enumeration ID for parents with the filename, and propagate the source info into the stringified response, you can follow the steps below:
Load multiple text files: You can use a list of URLs and iterate over them. For each URL, you can create an instance of PyPDFLoader
and call the load
method to load the document.
Include the source URL: You can add a new field to the document dictionary. For example, you can add doc['source'] = url
after loading the document.
Handle deprecation warnings: You can use the warnings
module in Python. You can add warnings.filterwarnings("ignore", category=DeprecationWarning)
at the beginning of your script to ignore deprecation warnings.
Upgrade to the latest version of LangChain: You can use pip install with the --upgrade
flag. You can run pip install --upgrade langchain
in your terminal to upgrade to the latest version of LangChain.
Replace the enumeration ID for parents with the filename: You can modify the split_documents
method. Instead of using an enumeration ID for the parent, you can use the filename. For example, you can add doc['parent'] = os.path.basename(url)
after splitting the document.
Propagate the source info into the stringified response: You can modify the from_documents
method. You can add a new field to the document dictionary that includes the source info. Then, you can modify the from_documents
method to include this new field in the stringified response.
Here is the modified code:
import os
import warnings
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient
# Handle deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
MONGO_URI = os.environ["MONGO_URI"]
DB_NAME = "langchain-test-2"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"
client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]
if __name__ == "__main__":
# Load docs
urls = ["https://arxiv.org/pdf/2303.08774.pdf", "https://arxiv.org/pdf/2303.08775.pdf"]
for url in urls:
loader = PyPDFLoader(url)
data = loader.load()
# Add source URL to the document
for doc in data:
doc['source'] = url
# Split docs
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)
# Replace enumeration ID for parents with the filename
for doc in docs:
doc['parent'] = os.path.basename(url)
# Insert the documents in MongoDB Atlas Vector Search
_ = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
Please note that this code assumes that the split_documents
and from_documents
methods can accept the modified document dictionaries. If they can't, you will need to modify these methods accordingly. Also, to upgrade to the latest version of LangChain, you can run pip install --upgrade langchain
in your terminal.
For more information, you can refer to the following sources:
Regarding the UI presentation of the source information with the response, it would depend on the specific UI framework you are using. Generally, you could include the source information as part of the response object and then display it in the UI as needed. For example, if you are using a JavaScript-based UI framework, you could include the source information in the JSON response object returned by your API, and then display it in the UI using JavaScript.
As for the propagation of the source info into the stringified response, you could modify the StrOutputParser
to include the source info. You would need to modify the parse
method of StrOutputParser
to include the source info in the stringified response. For example, you could add a new field to the response dictionary that includes the source info, and then include this new field in the stringified response.
I hope this helps! If you have any further questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
What a bunch of garbage that was, welcome to the era of code assisted garbage!
Feature request
I would like to learn how to modify the ingest.py to load multiple text file and include their source url.
Motivation
Currently only one file is indexed.
Your contribution
I know well Neo4J, LangChain but not LangServe. I can help once I get the concept. @efriis and @tomasonjo original author