langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.03k stars 14.64k forks source link

Type error in ParentDocumentRetriever using LocalFileStore #9345

Open Giulianini opened 1 year ago

Giulianini commented 1 year ago

Bug

LocalFileStore tries to treat Document as byte

store = LocalFileStore(get_project_relative_path("doc_store"))
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    retriever = ParentDocumentRetriever(vectorstore=vectorstore,
                                        docstore=store,
                                        parent_splitter=parent_splitter,
                                        child_splitter=child_splitter)
    if embed:
        docs = []
        data_folder = get_project_relative_path("documents")
        for i, file_path in enumerate(data_folder.iterdir()):
            document = TextLoader(str(file_path))
            docs.extend(document.load())
        retriever.add_documents(docs, None)

Here the broken method:

    def mset(self, key_value_pairs: Sequence[Tuple[str, bytes]]) -> None:
        """Set the values for the given keys.

        Args:
            key_value_pairs: A sequence of key-value pairs.

        Returns:
            None
        """
        for key, value in key_value_pairs:
            full_path = self._get_full_path(key)
            full_path.parent.mkdir(parents=True, exist_ok=True)
            full_path.write_bytes(value)

TypeError: memoryview: a bytes-like object is required, not 'Document'

Who can help?

No response

Information

Related Components

Reproduction

Create a LocalFileStore Use a ParentDoucmentRetriever

Expected behavior

Serialize the documents as bytes

dosubot[bot] commented 1 year ago

πŸ€–

Hello,

Thank you for bringing this issue to our attention. The error you're encountering is due to the fact that the LocalFileStore is trying to write a Document object directly to the file system, which is not possible because the Document object is not a bytes-like object.

To resolve this issue, you need to serialize the Document objects before storing them in the LocalFileStore. You can use Python's built-in pickle module to do this. Here's how you can modify the add_documents method in the ParentDocumentRetriever class to serialize the documents before storing them:

import pickle

# ...

def add_documents(
    self,
    documents: List[Document],
    ids: Optional[List[str]],
    add_to_docstore: bool = True,
) -> None:
    # ...
    if add_to_docstore:
        # Serialize the documents before storing them
        serialized_docs = [(id, pickle.dumps(doc)) for id, doc in full_docs]
        self.docstore.mset(serialized_docs)

In this code, pickle.dumps(doc) is used to serialize each Document object. The serialized documents are then stored in the LocalFileStore using the mset method.

Please note that you will also need to deserialize the documents when retrieving them from the LocalFileStore. You can use pickle.loads() for this purpose. When retrieving the documents using the mget method from the LocalFileStore, remember to deserialize the returned bytes into Document objects.

I hope this helps! If you have any other questions or run into any other issues, please don't hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Giulianini commented 1 year ago

Thanks, But It's seems a bug, for now it's ok as a workaround but I think it must be considered a bug and resolve it in the framework

k1ng commented 1 year ago

Same problem, fix it please

k1ng commented 1 year ago

And with RedisStore: redis.exceptions.DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.

nickhausman commented 1 year ago

Bump on this, both RedisStore and Local not working

cam-barts commented 11 months ago

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

weissenbacherpwc commented 9 months ago

@cam-barts This helped a lot! What is confusiong to me is the filestore of "fs" and vectorstore. E.g. my code here with using Chroma as vectorstore:

def run_db_build():
    loader = DirectoryLoader(cfg.DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)
    documents = loader.load()

    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})

    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    #store = InMemoryStore()
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)

    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    big_chunks_retriever.add_documents(documents)

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this? And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

cam-barts commented 9 months ago

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)
weissenbacherpwc commented 9 months ago

Got it with the directories, thanks!

In the script it would be the easiest way to use the created big_chunks_retriever in the RetrievalQA chain. But if I think if one wants to use the the retriever in production, it will take too long to create the retriever and calling the big_chunks_retriever.add_documents(documents) everytime at the start right? Thats why I wanted to store the retriever somewhere.

cam-barts commented 9 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

gcheron commented 7 months ago

This PR aims at adding support for document storage in a SQL database: https://github.com/langchain-ai/langchain/pull/15909

kallebl0mquist commented 7 months ago

i am having the same problem with LocalFileStore but with multi vector retriever. I get the type error when I am trying to store texts. no type error when I store images.

so, when I use create_kv_docstore as a workaround my problem is the other way around. Type error with images , no type error with texts. the texts are list of strings and not documents at that point.

With Inmemory storing everything is fine, but that's not an option for real use.

tyatabe commented 7 months ago

Whas this ever fixed for Redis? I'm getting the same error when adding documents to the retriever DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first

rchen19 commented 6 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

muazhari commented 5 months ago

Any fix to the Redis store? I have the same issue when using that with MultiVectorRetriever as docstore:

DataError: Invalid input of type: 'Document'. Convert to a bytes, string, int or float first.
thdesc commented 5 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever
weissenbacherpwc commented 4 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

nice this works!! Is there any way to select PGVector as vector database?

rchen19 commented 4 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever. For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Thanks for the example. One question, in your example of rebuild_retriever() function, the two textsplitters are in fact not used, correct, I think they are only used when ParentDocumentRetriever.add_documents() is called? Thus the parameters for them do not matter I guess?

@rchen19 Yes, the two text splitters are not useful here and could be removed. In fact, the ParentDocumentRetriever could be replaced by a MultiVectorRetriever instance since the difference between the two is the add_documents method that the ParentDocumentRetriever has. The rebuild_retriever function can be implemented this way:

# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

Very informative, thank you. I did not realize the only difference between the two classes is the add_documents method.

parthamadhira commented 4 months ago
# src/build_retriever.py
from langchain.retrievers import MultiVectorRetriever

def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    #vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    vectorstore =  Chroma(client=chroma_client, collection_name="ap_collection_parent", 
                             embedding_function=embeddings)
    big_chunks_retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
    )
    return big_chunks_retriever

In the above code, the parent collection is only passed which is a local dir, How is it linked to the child collection that is persisted in the chromadb? When using conversational chain as below:

   conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=big_chunks_retriever,
        memory=memory, verbose=True, return_source_documents=True
    )

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

codemigs commented 2 months ago

@weissenbacherpwc correct! I usually have different ingestion and use scripts, but the way that you build the retriever will be the same. So in one part of your process, you do document upload, and in another part you just use the retriever.

For example, your function that you shared would be really good for ingest. Depending on my requirements, I'd either return the retriever from that function to be used elsewhere, or I'd rebuild it where I need it. Here are two examples:

def run_db_build():
    # ....
    big_chunks_retriever.add_documents(documents)
    return big_chunks_retriever

retriever = run_db_build() # get retriever object as a global to be reused

def run_qa_chain(query):
    retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=retriever) # use global retriever
    return retrievalQA.run(query)

Alternatively:

# src/ingest.py
def run_db_build():
    # ....
# src/build_retriever.py
def rebuild_retriever():
    """Recreate Retriever Object to be reused."""
    # only do what's needed to recreate the retriever
    # no need to actually load or split docs
    embeddings = HuggingFaceEmbeddings(model_name=cfg.EMBEDDING_MODEL_NAME,
                                       model_kwargs={'device': 'mps'}, encode_kwargs={'device': 'mps', 'batch_size': 32})
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    fs = LocalFileStore("./chroma_db_filestore")
    store = create_kv_docstore(fs)
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings,
                         persist_directory="chroma_db/") 
    big_chunks_retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    return big_chunks_retriever

That will return the same usable retriever, without needing to actually load any docs.

Without loading any docs, we won't get any context right?

In a production environment, I'd have to add the docs so we can retrieve something. The process of adding the documents while kicking off the system will take some time. Part of the reason why I want to load the vector store in a chroma db is for faster retrieval and so I can skip the indexing part.

Is there a faster way to add docs or skip that part and get context in a db environment? Loading the files, then adding docs will take some time. How do we skip that part and make the docs part of the chroma db whilst using ParentDocument retrieval?

huangpan2507 commented 2 months ago

@nickhausman @k1ng @Giulianini I came to this thread because I had the same issue, but after looking at the commit @eyurtsev (thank you so much!) linked to this issue, I was able to resolve my issue and wanted to make sure that was documented here. Here's what I had before that didn't work:

    store = LocalFileStore("./store_location")
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

Here are the changes I made that got it work:

from langchain.storage._lc_store import create_kv_docstore
# ...
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    vectorstore = ...
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )

It looks like this was made available since v0.0.277.

I hope this helps!

Great! I will try it, genius!!!

huangpan2507 commented 2 months ago

In several examples I saw that the LocalFileStore path and the persist_directory path are different. Why is this?

@weissenbacherpwc glad it helped. To the best of my knowledge, the persist_directory that you pass into the vector store initialization is where specifically only the vector store's data exists. In your example, Chroma doesn't know about ParentDocumentRetriever at all. What ParentDocumentRetriever does for your chain is create associations between document IDs which are stored in the vector store, which happens to be Chroma but could be anything. Those associations need to live somewhere, and that is what you declare with the LocalFileStore path. That path could exist inside of your Chroma directory if you wish it to, and indeed for me that's what I've done in the past, but it doesn't have to.

And if I want to use the saved ParentDocumentRetriever in a RetrievalQA chain, which folder do I have to load?

I think when you initialize most chains that use docs in this way, you pass in the retriever itself and not a path. So in this case, if you wanted to use the newly created big_chunks_retriever, which is your parent-child retriever object, you'd pass that in directly:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI # Or whatever makes sense for you

my_llm = OpenAI() # Or whatever makes sense for you

# ...

big_chunks_retriever.add_documents(documents)
retrievalQA = RetrievalQA.from_llm(llm=my_llm, retriever=big_chunks_retriever)

Hi, @cam-barts , thanks for your solution,

       # Get elements
        one_raw_pdf_elements = partition_pdf(
        filename=file_name,
        languages=["chinese",],
        # strategy='hi_res',
        # Using pdf format to find embedded image blocks
        extract_images_in_pdf=True,
        # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
        # Titles are any sub-section of the document
        infer_table_structure=True,
        # Post processing to aggregate text once we have the title
        chunking_strategy="by_title",
        extract_image_block_output_dir=self._img_path,
        form_extraction_skip_tables = False
        )
        raw_pdf_elements.extend(one_raw_pdf_elements)

        # Categorize by type
        categorized_elements = []
        for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
           categorized_elements.append(Element(type="table", text=str(element)))
        elif "unstructured.documents.elements.CompositeElement" in >str(type(element)):
           categorized_elements.append(Element(type="text", text=str(element)))

        # Tables
        table_elements = [e for e in categorized_elements if e.type == "table"]

        # Text
        text_elements = [e for e in categorized_elements if e.type == "text"]  

        embeddings = >HuggingFaceEmbeddings(model_name="/mnt/AI/models/embedding_model")

        vectorstore = Chroma(
            persist_directory=self._persist_directory,
            embedding_function=embeddings
         )

        parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, >chunk_overlap=200)
       child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
       fs = >LocalFileStore("/mnt/AI/data_base/vector_db/chroma_docs_raw_text_table_image")
      store = create_kv_docstore(fs)
      id_key = "doc_id"
      retriever = ParentDocumentRetriever(
         vectorstore=vectorstore,
         docstore=store,
         child_splitter=child_splitter,
         parent_splitter=parent_splitter,
       )                

if I want to useretriever .add_documents(xx), should I useretriever .add_documents(one_raw_pdf_elements )`? It is right?

huangpan2507 commented 1 month ago

how will the query first search for child documents and then return the corresponding parent documents? Thanks for the clarification in advance

good question! same doubt about the process with child documents(store in the vector store,eg chroma) and parent documents(store in LocalFileStore), how to deal with them when I query the question to llm?

rbs333 commented 1 month ago

Touched base with some langchain folks on this issue. One way of handling this without overriding the default is to use the EncoderBackedStore class. We did this with Redis as our base store but would be extendable.

from langchain.storage.encoder_backed import EncoderBackedStore
from langchain.storage import RedisStore
import pickle

def key_encoder(key: int | str) -> str:
    return str(key)

def value_serializer(value: float) -> str:
    return pickle.dumps(value)

def value_deserializer(serialized_value: str) -> float:
    return pickle.loads(serialized_value)

# Create an instance of the abstract store
abstract_store = RedisStore(redis_url="redis://localhost:6379", namespace="parent_docs")

# Create an instance of the encoder-backed store
encoder_store = EncoderBackedStore(
    store=abstract_store,
    key_encoder=key_encoder,
    value_serializer=value_serializer,
    value_deserializer=value_deserializer
)

from langchain.retrievers import ParentDocumentRetriever

parent_doc_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=encoder_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)