langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.7k stars 13.94k forks source link

DOC: ParentDocumentRetriever without InMemoryStore #14267

Open austinmw opened 7 months ago

austinmw commented 7 months ago

Issue with current documentation:

Hi, I'm not quite sure how to translate the ParentDocumentRetriever examples to ingest documents to OpenSearch in one phase, and then reconnect to it by instantiating a retriever at a later point.

The examples use an InMemoryStore() for the parent documents. Is the idea then that it would be necessary if I wanted to use OpenSearch to create two different OpenSearch clusters, one for the parent docs and one for the child docs? Or is there a more simple way to do this?

dosubot[bot] commented 7 months ago

🤖

Hi @austinmw, great to see you again! I appreciate your continued interest in the LangChain project.

Based on your question, it seems like you're trying to use the ParentDocumentRetriever with OpenSearch to ingest documents in one phase and then reconnect to it at a later point. You don't need to create two different OpenSearch clusters for parent and child documents. Instead, you can use the same OpenSearch cluster and add documents to it in different phases.

Here's how you can do it:

  1. Initialize the ParentDocumentRetriever with the appropriate vectorstore, docstore, child_splitter, and parent_splitter. The vectorstore should be an instance of a class that interfaces with OpenSearch, and the docstore should be an instance of a class that interfaces with your document storage system.

  2. Use the add_documents method of the ParentDocumentRetriever to add your documents. If you are adding documents for the first time, you should set add_to_docstore to True and not provide any ids. This will generate random UUIDs for each document and add them to both the vectorstore and the docstore.

  3. If you are reconnecting to the same OpenSearch cluster later and want to add more documents without creating a new cluster, you can again use the add_documents method. This time, you should provide the ids of the documents you are adding and set add_to_docstore to False. This will add the new documents to the vectorstore without adding them to the docstore again.

Here is an example of how you can do this:

# Imports
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore

# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryStore()

# Initialize the retriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents for the first time
documents = [...]  # List of your documents
retriever.add_documents(documents)

# Reconnect to the same OpenSearch cluster later and add more documents
new_documents = [...]  # List of your new documents
new_ids = [...]  # List of ids for your new documents
retriever.add_documents(new_documents, ids=new_ids, add_to_docstore=False)

Please replace [...] with your actual data.

The ParentDocumentRetriever class in LangChain does not directly handle the connection with OpenSearch. Instead, it interacts with a vectorstore and a docstore which are passed as arguments during the initialization of the ParentDocumentRetriever instance. The vectorstore and docstore could potentially be connected to OpenSearch, but that would depend on their implementation.

You can find more details about the ParentDocumentRetriever class in the LangChain repository.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

austinmw commented 7 months ago

Above example doesn't include OpenSearch at all.

austinmw commented 7 months ago

Updated title, curious if there's any way to do this generally without InMemoryStore, even outside of OpenSearch?

kovcic commented 7 months ago

hi, I had the same problem as you. there's no implementation other than in-memory (actually there's only GoogleCloudStorageDocstore in addition). I implemented UpstashRedisDocstore which is internally using UpstashRedisStore... still testing it. but in short, there's no other Docstore. or at least I'm not aware of any.

austinmw commented 7 months ago

Thanks. Hopefully more Docstore are added and/or shared in the near future!

RERobbins commented 6 months ago

+1. I love the ParentDocumentRetriever concept, but without more docstore alternatives it's of limited utility. Any creative ideas?

gcheron commented 5 months ago

This PR aims at adding support for document storage in a SQL database: https://github.com/langchain-ai/langchain/pull/15909

ugm2 commented 5 months ago

Am I the only one who thinks that if you are working with a specific store (OpenSearch, ElasticSearch...) you should be able to persist the parent/child data within that store?

austinmw commented 5 months ago

Seems strange to me too. Why not just add ID in metadata for each chunk and retrieve the surrounding chunks by metadata ID?

RERobbins commented 5 months ago

I took the in memory class and created a mongodb variant without much difficulty. Objects from that class seem to work fine. I haven’t put in the effort to suggest that as an addition to LangChain. I’m happy to share what I’ve done in case it’s useful. It’s just a very simple key value store.

On Mon, Jan 15, 2024 at 8:24 AM Austin Welch @.***> wrote:

Seems strange to me too. Why not just add ID in metadata for each chunk and retrieve the surrounding chunks by metadata ID?

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/14267#issuecomment-1892270282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYUR2KUSSXYX2N6OHJEYZDYOU3YTAVCNFSM6AAAAABAHBHRCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSGI3TAMRYGI . You are receiving this because you commented.Message ID: @.***>

clemenspeters commented 4 months ago

ParentDocumentRetriever for pgvector would be nice 😃

arezazadeh commented 3 months ago

I am trying to do the same thing, but from what I am understanding is, the reason is in MemoryStore, is because is not meant to be in a persistant database, rather for user's who wants to upload a file and do some searching and thats it. something similar to openai.com where you can upload a file, but the file goes away after sometimes.

anyway, I love the concept, and i hope i could impelment it with my pgvector store

konradbjk commented 3 months ago

Reading the retriever API docs, I can see that the docstore is a class of BaseStore. Looking at the methods, it is Key-Value pair hence Redis seems native. @kovcic how is your implementation of UpstashRedisDocstore going?

@clemenspeters ParentDocumentRetriever to my understanding does not embedd the parent documents. It keeps them as a plain text. It would be inefficient cost-wise to embed parent documents, if we do not perform a semantic search on them (paying embedd twice for 0 gain)

kovcic commented 3 months ago

Reading the retriever API docs, I can see that the docstore is a class of BaseStore. Looking at the methods, it is Key-Value pair hence Redis seems native. @kovcic how is your implementation of UpstashRedisDocstore going?

Here's what I used:

import { Docstore } from 'langchain/schema';
import { BaseStore } from 'langchain/schema/storage';
import { Document } from 'langchain/document';
import { UpstashRedisStore, UpstashRedisStoreInput } from 'langchain/storage/upstash_redis';
import { first, isEmpty, map } from 'lodash';

// eslint-disable-next-line @typescript-eslint/ban-ts-comment
// @ts-expect-error
export class UpstashRedisDocstore extends Docstore implements BaseStore<string, Document> {
  _store: UpstashRedisStore;

  constructor(fields: UpstashRedisStoreInput) {
    super();

    this._store = new UpstashRedisStore(fields);
  }

  /**
   * Searches for a document in the store based on its ID.
   * @param search The ID of the document to search for.
   * @returns The document with the given ID.
   */
  async search(search: string): Promise<Document> {
    const documents = await this.mget([search]);
    const document = first(documents);

    if (!document) {
      throw new Error(`ID ${search} not found.`);
    }

    return document;
  }

  /**
   * Adds new documents to the store.
   * @param texts An object where the keys are document IDs and the values are the documents themselves.
   * @returns Void
   */
  async add(texts: Record<string, Document>): Promise<void> {
    const keyValuePairs = map(texts, (value, key) => [key, value] as [string, Document]);

    await this.mset(keyValuePairs);
  }

  async mget(keys: string[]): Promise<Document[]> {
    if (isEmpty(keys)) {
      return [];
    }

    const retrievedMessages = await this._store.mget(keys);
    const documents = map(retrievedMessages, (v) => new TextDecoder().decode(v));

    return map(documents, (document) => JSON.parse(document));
  }

  async mset(keyValuePairs: [string, Document][]): Promise<void> {
    const encodedKeyValuePairs = map(
      keyValuePairs,
      ([key, document]) => [key, new TextEncoder().encode(JSON.stringify(document))] as [string, Uint8Array],
    );

    await this._store.mset(encodedKeyValuePairs);
  }

  async mdelete(keys: string[]): Promise<void> {
    await this._store.mdelete(keys);
  }

  async *yieldKeys(prefix?: string): AsyncGenerator<string> {
    return this._store.yieldKeys(prefix);
  }
}
arezazadeh commented 3 months ago

Guys, i think i did it but please take a look and tell me if its efficient:

So below i am getting my content from my PGVector and adding that to the Chroma db which i am treating it as a temporary vectorestore/collection, if you call this function again, (or send another query) the collection will be deleted and re-written by the new relevant content from my PGStore. I have tested in my dev and its working perfectly fine, i have to see how it will behave when 100s of users are using it.


def parent_child_retrieval(collection_name, query, k=9):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.retrievers import ParentDocumentRetriever
    from langchain.storage import InMemoryStore
    from langchain_community.vectorstores import Chroma

    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)

    CONNECTION_STRING = get_connection_string()

    azure_embedding = AzureOpenAIEmbeddings(
        azure_deployment="mydeployment",
        openai_api_version="2023-05-15",
    )

    # My Persistent Vector Store
    pgvector_store = PGVector(
        collection_name=collection_name,
        connection_string=CONNECTION_STRING,
        embedding_function=azure_embedding,
    )

    # deleting previous temp collection
    Chroma(
        collection_name=collection_name, embedding_function=azure_embedding
    ).delete_collection()

    # creating new temp collection
    vectorstore = Chroma(
        collection_name=collection_name, embedding_function=azure_embedding
    )

    store = InMemoryStore()

    # Create a retriever based on the temp collection with Chroma 
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
    )

    # selecting similarity search with score from my PGVector Store
    pg_retriever = pgvector_store.as_retriever(
        search_type="similarity_score_threshold", 
        search_kwargs={"score_threshold": 0.67, "k": k}
        )

    # getting relevant documents from my PGVector Store
    docs = pg_retriever.get_relevant_documents(query)

    print("docs length: ", len(docs))

    # adding the relevant documents to the temp collection
    retriever.add_documents(docs)

    return retriever