Open austinmw opened 7 months ago
🤖
Hi @austinmw, great to see you again! I appreciate your continued interest in the LangChain project.
Based on your question, it seems like you're trying to use the ParentDocumentRetriever
with OpenSearch to ingest documents in one phase and then reconnect to it at a later point. You don't need to create two different OpenSearch clusters for parent and child documents. Instead, you can use the same OpenSearch cluster and add documents to it in different phases.
Here's how you can do it:
Initialize the ParentDocumentRetriever
with the appropriate vectorstore
, docstore
, child_splitter
, and parent_splitter
. The vectorstore
should be an instance of a class that interfaces with OpenSearch, and the docstore
should be an instance of a class that interfaces with your document storage system.
Use the add_documents
method of the ParentDocumentRetriever
to add your documents. If you are adding documents for the first time, you should set add_to_docstore
to True
and not provide any ids
. This will generate random UUIDs for each document and add them to both the vectorstore and the docstore.
If you are reconnecting to the same OpenSearch cluster later and want to add more documents without creating a new cluster, you can again use the add_documents
method. This time, you should provide the ids
of the documents you are adding and set add_to_docstore
to False
. This will add the new documents to the vectorstore without adding them to the docstore again.
Here is an example of how you can do this:
# Imports
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryStore()
# Initialize the retriever
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Add documents for the first time
documents = [...] # List of your documents
retriever.add_documents(documents)
# Reconnect to the same OpenSearch cluster later and add more documents
new_documents = [...] # List of your new documents
new_ids = [...] # List of ids for your new documents
retriever.add_documents(new_documents, ids=new_ids, add_to_docstore=False)
Please replace [...]
with your actual data.
The ParentDocumentRetriever
class in LangChain does not directly handle the connection with OpenSearch. Instead, it interacts with a vectorstore
and a docstore
which are passed as arguments during the initialization of the ParentDocumentRetriever
instance. The vectorstore
and docstore
could potentially be connected to OpenSearch, but that would depend on their implementation.
You can find more details about the ParentDocumentRetriever
class in the LangChain repository.
I hope this helps! If you have any more questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Above example doesn't include OpenSearch at all.
Updated title, curious if there's any way to do this generally without InMemoryStore
, even outside of OpenSearch?
hi, I had the same problem as you. there's no implementation other than in-memory (actually there's only GoogleCloudStorageDocstore in addition). I implemented UpstashRedisDocstore
which is internally using UpstashRedisStore
... still testing it. but in short, there's no other Docstore
. or at least I'm not aware of any.
Thanks. Hopefully more Docstore are added and/or shared in the near future!
+1. I love the ParentDocumentRetriever concept, but without more docstore alternatives it's of limited utility. Any creative ideas?
This PR aims at adding support for document storage in a SQL database: https://github.com/langchain-ai/langchain/pull/15909
Am I the only one who thinks that if you are working with a specific store (OpenSearch, ElasticSearch...) you should be able to persist the parent/child data within that store?
Seems strange to me too. Why not just add ID in metadata for each chunk and retrieve the surrounding chunks by metadata ID?
I took the in memory class and created a mongodb variant without much difficulty. Objects from that class seem to work fine. I haven’t put in the effort to suggest that as an addition to LangChain. I’m happy to share what I’ve done in case it’s useful. It’s just a very simple key value store.
On Mon, Jan 15, 2024 at 8:24 AM Austin Welch @.***> wrote:
Seems strange to me too. Why not just add ID in metadata for each chunk and retrieve the surrounding chunks by metadata ID?
— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/14267#issuecomment-1892270282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYUR2KUSSXYX2N6OHJEYZDYOU3YTAVCNFSM6AAAAABAHBHRCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSGI3TAMRYGI . You are receiving this because you commented.Message ID: @.***>
ParentDocumentRetriever for pgvector would be nice 😃
I am trying to do the same thing, but from what I am understanding is, the reason is in MemoryStore, is because is not meant to be in a persistant database, rather for user's who wants to upload a file and do some searching and thats it. something similar to openai.com where you can upload a file, but the file goes away after sometimes.
anyway, I love the concept, and i hope i could impelment it with my pgvector store
Reading the retriever API docs, I can see that the docstore
is a class of BaseStore. Looking at the methods, it is Key-Value pair hence Redis seems native. @kovcic how is your implementation of UpstashRedisDocstore
going?
@clemenspeters ParentDocumentRetriever to my understanding does not embedd the parent documents. It keeps them as a plain text. It would be inefficient cost-wise to embed parent documents, if we do not perform a semantic search on them (paying embedd twice for 0 gain)
Reading the retriever API docs, I can see that the
docstore
is a class of BaseStore. Looking at the methods, it is Key-Value pair hence Redis seems native. @kovcic how is your implementation ofUpstashRedisDocstore
going?
Here's what I used:
import { Docstore } from 'langchain/schema';
import { BaseStore } from 'langchain/schema/storage';
import { Document } from 'langchain/document';
import { UpstashRedisStore, UpstashRedisStoreInput } from 'langchain/storage/upstash_redis';
import { first, isEmpty, map } from 'lodash';
// eslint-disable-next-line @typescript-eslint/ban-ts-comment
// @ts-expect-error
export class UpstashRedisDocstore extends Docstore implements BaseStore<string, Document> {
_store: UpstashRedisStore;
constructor(fields: UpstashRedisStoreInput) {
super();
this._store = new UpstashRedisStore(fields);
}
/**
* Searches for a document in the store based on its ID.
* @param search The ID of the document to search for.
* @returns The document with the given ID.
*/
async search(search: string): Promise<Document> {
const documents = await this.mget([search]);
const document = first(documents);
if (!document) {
throw new Error(`ID ${search} not found.`);
}
return document;
}
/**
* Adds new documents to the store.
* @param texts An object where the keys are document IDs and the values are the documents themselves.
* @returns Void
*/
async add(texts: Record<string, Document>): Promise<void> {
const keyValuePairs = map(texts, (value, key) => [key, value] as [string, Document]);
await this.mset(keyValuePairs);
}
async mget(keys: string[]): Promise<Document[]> {
if (isEmpty(keys)) {
return [];
}
const retrievedMessages = await this._store.mget(keys);
const documents = map(retrievedMessages, (v) => new TextDecoder().decode(v));
return map(documents, (document) => JSON.parse(document));
}
async mset(keyValuePairs: [string, Document][]): Promise<void> {
const encodedKeyValuePairs = map(
keyValuePairs,
([key, document]) => [key, new TextEncoder().encode(JSON.stringify(document))] as [string, Uint8Array],
);
await this._store.mset(encodedKeyValuePairs);
}
async mdelete(keys: string[]): Promise<void> {
await this._store.mdelete(keys);
}
async *yieldKeys(prefix?: string): AsyncGenerator<string> {
return this._store.yieldKeys(prefix);
}
}
Guys, i think i did it but please take a look and tell me if its efficient:
So below i am getting my content from my PGVector and adding that to the Chroma db which i am treating it as a temporary vectorestore/collection, if you call this function again, (or send another query) the collection will be deleted and re-written by the new relevant content from my PGStore. I have tested in my dev and its working perfectly fine, i have to see how it will behave when 100s of users are using it.
def parent_child_retrieval(collection_name, query, k=9):
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
CONNECTION_STRING = get_connection_string()
azure_embedding = AzureOpenAIEmbeddings(
azure_deployment="mydeployment",
openai_api_version="2023-05-15",
)
# My Persistent Vector Store
pgvector_store = PGVector(
collection_name=collection_name,
connection_string=CONNECTION_STRING,
embedding_function=azure_embedding,
)
# deleting previous temp collection
Chroma(
collection_name=collection_name, embedding_function=azure_embedding
).delete_collection()
# creating new temp collection
vectorstore = Chroma(
collection_name=collection_name, embedding_function=azure_embedding
)
store = InMemoryStore()
# Create a retriever based on the temp collection with Chroma
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
)
# selecting similarity search with score from my PGVector Store
pg_retriever = pgvector_store.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.67, "k": k}
)
# getting relevant documents from my PGVector Store
docs = pg_retriever.get_relevant_documents(query)
print("docs length: ", len(docs))
# adding the relevant documents to the temp collection
retriever.add_documents(docs)
return retriever
Issue with current documentation:
Hi, I'm not quite sure how to translate the
ParentDocumentRetriever
examples to ingest documents to OpenSearch in one phase, and then reconnect to it by instantiating a retriever at a later point.The examples use an
InMemoryStore()
for the parent documents. Is the idea then that it would be necessary if I wanted to use OpenSearch to create two different OpenSearch clusters, one for the parent docs and one for the child docs? Or is there a more simple way to do this?