Open thedevstone opened 1 year ago
π€
Hello,
Thank you for bringing this issue to our attention. The error you're encountering is due to the fact that the LocalFileStore
is trying to write a Document
object directly to the file system, which is not possible because the Document
object is not a bytes-like object.
To resolve this issue, you need to serialize the Document
objects before storing them in the LocalFileStore
. You can use Python's built-in pickle
module to do this. Here's how you can modify the add_documents
method in the ParentDocumentRetriever
class to serialize the documents before storing them:
import pickle
# ...
def add_documents(
self,
documents: List[Document],
ids: Optional[List[str]],
add_to_docstore: bool = True,
) -> None:
# ...
if add_to_docstore:
# Serialize the documents before storing them
serialized_docs = [(id, pickle.dumps(doc)) for id, doc in full_docs]
self.docstore.mset(serialized_docs)
In this code, pickle.dumps(doc)
is used to serialize each Document
object. The serialized documents are then stored in the LocalFileStore
using the mset
method.
Please note that you will also need to deserialize the documents when retrieving them from the LocalFileStore
. You can use pickle.loads()
for this purpose. When retrieving the documents using the mget
method from the LocalFileStore
, remember to deserialize the returned bytes into Document
objects.
I hope this helps! If you have any other questions or run into any other issues, please don't hesitate to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Thanks, But It's seems a bug, for now it's ok as a workaround but I think it must be considered a bug and resolve it in the framework
Same problem, fix it please
And with RedisStore: redis.exceptions.DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.
Bump on this, both RedisStore and Local not working
@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:
store = LocalFileStore("./store_location")
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = ...
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
Here are the changes I made that got it work:
from langchain.storage._lc_store import create_kv_docstore
# ...
fs = LocalFileStore("./store_location")
store = create_kv_docstore(fs)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = ...
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
It looks like this was made available since v0.0.277.
I hope this helps!
@cam-barts This helped a lot! What is confusiong to me is the filestore of "fs" and vectorstore. E.g. my code here with using Chroma as vectorstore:
def run_db_build():
loader = DirectoryLoader(cfg.DATA_PATH,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
#store = InMemoryStore()
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
big_chunks_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
big_chunks_retriever.add_documents(documents)
In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this? And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?
In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?
@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory
that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever
at all. What ParentDocumentRetriever
does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore
path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.
And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?
I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever
, which is your parent-child retriever object, you'd pass that in directly:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you
my_llm = OpenAI() # Or whatever makes sense for you
# ...
big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)
Got it with the directories, thanks!
In the script it would be the easiest way to use the created big_chunks_retriever
in the RetrievalQA chain. But if I think if one wants to use the the retriever in production, it will take too long to create the retriever and calling the big_chunks_retriever.add_documents(documents)
everytime at the start right? Thats why I wanted to store the retriever somewhere.
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build():
# ....
big_chunks_retriever.add_documents(documents)
return big_chunks_retriever
retriever = run_db_build() # get retriever object as a global to be reused
def run_qa_chain(query):
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
return retrievalQA.run(query)
Alternatively:
# src/ingest.py
def run_db_build():
# ....
# src/build_retriever.py
def rebuild_retriever():
"""Recreate Retriever Object to be reused."""
# only do what's needed to recreate the retriever
# no need to actually load or split docs
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
big_chunks_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
This PR aims at adding support for document storage in a SQL database: https://github.com/langchain-ai/langchain/pull/15909
i am having the same problem with LocalFileStore but with multi vector retriever. I get the type error when I am trying to store texts. no type error when I store images.
so, when I use create_kv_docstore as a workaround my problem is the other way around. Type error with images , no type error with texts. the texts are list of strings and not documents at that point.
With Inmemory storing everything is fine, but that's not an option for real use.
Whas this ever fixed for Redis? I'm getting the same error when adding documents to the retriever
DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build(): # .... big_chunks_retriever.add_documents(documents) return big_chunks_retriever retriever = run_db_build() # get retriever object as a global to be reused def run_qa_chain(query): retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever return retrievalQA.run(query)
Alternatively:
# src/ingest.py def run_db_build(): # ....
# src/build_retriever.py def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of rebuild_retriever()
function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents()
is called? Thus the parameters for them do not matter I guess?
Any fix to the Redis store? I have the same issue when using that with MultiVectorRetriever as docstore:
DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build(): # .... big_chunks_retriever.add_documents(documents) return big_chunks_retriever retriever = run_db_build() # get retriever object as a global to be reused def run_qa_chain(query): retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever return retrievalQA.run(query)
Alternatively:
# src/ingest.py def run_db_build(): # ....
# src/build_retriever.py def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of
rebuild_retriever()
function, the two textsplitters are in fact not used, correct, I think they are only used whenParentDocumentRetriever.add_documents()
is called? Thus the parameters for them do not matter I guess?
@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever
could be replaced by a MultiVectorRetriever
instance since the difference between the two is the add_documents
method that the ParentDocumentRetriever
has. The rebuild_retriever
function can be implemented this way:
# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever
def rebuild_retriever():
"""Recreate Retriever Object to be reused."""
# only do what's needed to recreate the retriever
# no need to actually load or split docs
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
big_chunks_retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
)
return big_chunks_retriever
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build(): # .... big_chunks_retriever.add_documents(documents) return big_chunks_retriever retriever = run_db_build() # get retriever object as a global to be reused def run_qa_chain(query): retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever return retrievalQA.run(query)
Alternatively:
# src/ingest.py def run_db_build(): # ....
# src/build_retriever.py def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of
rebuild_retriever()
function, the two textsplitters are in fact not used, correct, I think they are only used whenParentDocumentRetriever.add_documents()
is called? Thus the parameters for them do not matter I guess?@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the
ParentDocumentRetriever
could be replaced by aMultiVectorRetriever
instance since the difference between the two is theadd_documents
method that theParentDocumentRetriever
has. Therebuild_retriever
function can be implemented this way:# src/build_retriever.py from langchain.retrievers import MultiVectorRetriever def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = MultiVectorRetriever( vectorstore=vectorstore, docstore=store, ) return big_chunks_retriever
nice this works!! Is there any way to select PGVector as vector database?
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build(): # .... big_chunks_retriever.add_documents(documents) return big_chunks_retriever retriever = run_db_build() # get retriever object as a global to be reused def run_qa_chain(query): retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever return retrievalQA.run(query)
Alternatively:
# src/ingest.py def run_db_build(): # ....
# src/build_retriever.py def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Thanks for the example. One question, in your example of
rebuild_retriever()
function, the two textsplitters are in fact not used, correct, I think they are only used whenParentDocumentRetriever.add_documents()
is called? Thus the parameters for them do not matter I guess?@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the
ParentDocumentRetriever
could be replaced by aMultiVectorRetriever
instance since the difference between the two is theadd_documents
method that theParentDocumentRetriever
has. Therebuild_retriever
function can be implemented this way:# src/build_retriever.py from langchain.retrievers import MultiVectorRetriever def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = MultiVectorRetriever( vectorstore=vectorstore, docstore=store, ) return big_chunks_retriever
Very informative, thank you. I did not realize the only difference between the two classes is the add_documents
method.
# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever
def rebuild_retriever():
"""Recreate Retriever Object to be reused."""
# only do what's needed to recreate the retriever
# no need to actually load or split docs
embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
fs = LocalFileStore("./chroma_db_filestore")
store = create_kv_docstore(fs)
#vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
persist_directory="chroma_db/")
vectorstore = Chroma(client=chroma_client, collection_name="ap_collection_parent",
embedding_function=embeddings)
big_chunks_retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
)
return big_chunks_retriever
In the above code, the parent collection is only passed which is a local dir, How is it linked to the child collection that is persisted in the chromadb? When using conversational chain as below:
conversation_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=big_chunks_retriever,
memory=memory, verbose=True, return_source_documents=True
)
how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance
@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.
For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:
def run_db_build(): # .... big_chunks_retriever.add_documents(documents) return big_chunks_retriever retriever = run_db_build() # get retriever object as a global to be reused def run_qa_chain(query): retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever return retrievalQA.run(query)
Alternatively:
# src/ingest.py def run_db_build(): # ....
# src/build_retriever.py def rebuild_retriever(): """Recreate Retriever Object to be reused.""" # only do what's needed to recreate the retriever # no need to actually load or split docs embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME, model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32}) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = LocalFileStore("./chroma_db_filestore") store = create_kv_docstore(fs) vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="chroma_db/") big_chunks_retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) return big_chunks_retriever
That will return the same usable retriever, without needing to actually load any docs.
Without loading any docs, we won't get any context right?
In a production environment, I'd have to add the docs so we can retrieve something. The process of adding the documents while kicking off the system will take some time. Part of the reason why I want to load the vector store in a chroma db is for faster retrieval and so I can skip the indexing part.
Is there a faster way to add docs or skip that part and get context in a db environment? Loading the files, then adding docs will take some time. How do we skip that part and make the docs part of the chroma db whilst using ParentDocument retrieval?
@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:
store = LocalFileStore("./store_location") parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) vectorstore = ... retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, )
Here are the changes I made that got it work:
from langchain.storage._lc_store import create_kv_docstore # ... fs = LocalFileStore("./store_location") store = create_kv_docstore(fs) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) vectorstore = ... retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, )
It looks like this was made available since v0.0.277.
I hope this helps!
GreatοΌ I will try it, genius!!!
In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?
@weissenbacherpwc glad it helped. To the best of my knowledge, the
persist_directory
that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know aboutParentDocumentRetriever
at all. WhatParentDocumentRetriever
does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with theLocalFileStore
path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?
I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created
big_chunks_retriever
, which is your parent-child retriever object, you'd pass that in directly:from langchain.chains import RetrievalQA from langchain.llms import OpenAI # Or whatever makes sense for you my_llm = OpenAI() # Or whatever makes sense for you # ... big_chunks_retriever.add_documents(documents) retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)
Hi, @cam-barts , thanks for your solution,
# Get elements one_raw_pdf_elements = partition_pdf( filename=file_name, languages=["chinese",], # strategy='hi_res', # Using pdf format to find embedded image blocks extract_images_in_pdf=True, # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles # Titles are any sub-section of the document infer_table_structure=True, # Post processing to aggregate text once we have the title chunking_strategy="by_title", extract_image_block_output_dir=self._img_path, form_extraction_skip_tables = False ) raw_pdf_elements.extend(one_raw_pdf_elements) # Categorize by type categorized_elements = [] for element in raw_pdf_elements: if "unstructured.documents.elements.Table" in str(type(element)): categorized_elements.append(Element(type="table", text=str(element))) elif "unstructured.documents.elements.CompositeElement" in >str(type(element)): categorized_elements.append(Element(type="text", text=str(element))) # Tables table_elements = [e for e in categorized_elements if e.type == "table"] # Text text_elements = [e for e in categorized_elements if e.type == "text"] embeddings = >HuggingFaceEmbeddings(model_name="/mnt/AI/models/embedding_model") vectorstore = Chroma( persist_directory=self._persist_directory, embedding_function=embeddings ) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, >chunk_overlap=200) child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) fs = >LocalFileStore("/mnt/AI/data_base/vector_db/chroma_docs_raw_text_table_image") store = create_kv_docstore(fs) id_key = "doc_id" retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, )
if I want to use
retriever .add_documents(xx), should I use
retriever .add_documents(one_raw_pdf_elements )`? It is right?
how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance
good question! same doubt about the process with child documents(store in the vector store,eg chroma) and parent documents(store in LocalFileStore), how to deal with them when I query the question to llm?
Touched base with some langchain folks on this issue. One way of handling this without overriding the default is to use the EncoderBackedStore class. We did this with Redis as our base store but would be extendable.
from langchain.storage.encoder_backed import EncoderBackedStore
from langchain.storage import RedisStore
import pickle
def key_encoder(key: int | str) -> str:
return str(key)
def value_serializer(value: float) -> str:
return pickle.dumps(value)
def value_deserializer(serialized_value: str) -> float:
return pickle.loads(serialized_value)
# Create an instance of the abstract store
abstract_store = RedisStore(redis_url="redis://localhost:6379", namespace="parent_docs")
# Create an instance of the encoder-backed store
encoder_store = EncoderBackedStore(
store=abstract_store,
key_encoder=key_encoder,
value_serializer=value_serializer,
value_deserializer=value_deserializer
)
from langchain.retrievers import ParentDocumentRetriever
parent_doc_retriever = ParentDocumentRetriever(
vectorstore=vector_store,
docstore=encoder_store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?
@weissenbacherpwc glad it helped. To the best of my knowledge, the
persist_directory
that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know aboutParentDocumentRetriever
at all. WhatParentDocumentRetriever
does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with theLocalFileStore
path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?
I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created
big_chunks_retriever
, which is your parent-child retriever object, you'd pass that in directly:from langchain.chains import RetrievalQA from langchain.llms import OpenAI # Or whatever makes sense for you my_llm = OpenAI() # Or whatever makes sense for you # ... big_chunks_retriever.add_documents(documents) retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)
@rbs333 if i have got huge no of documents, how do write a scalable logic . ?
Bug
LocalFileStore tries to treat Document as byte
Here the broken method:
TypeError: memoryview: a bytes-like object is required, not 'Document'
Who can help?
No response
Information
Related Components
Reproduction
Create a LocalFileStore Use a ParentDoucmentRetriever
Expected behavior
Serialize the documents as bytes