theodo-group / LLPhant

LLPhant - A comprehensive PHP Generative AI Framework using OpenAI GPT 4. Inspired by Langchain
MIT License
768 stars 78 forks source link

[Feature] Support small-to-big retrieval #179

Open synio-wesley opened 1 month ago

synio-wesley commented 1 month ago

What I want to achieve basically is something like parent document retrieval and/or sentence window retrieval.

Basically I want to create smaller chunks so that the LLM can create more targeted vectors for the smaller chunks. But because we lose a lot of context that way, I also want to save bigger parent chunks. The smaller chunks should then point to the bigger chunks using metadata or something.

So basically we retrieve the smaller chunks using the vector distance algorithm like before, but then we grab the bigger parent chunks which contains more context and feed this to the LLM instead.

I'm not sure if this is supported yet or if we can add this functionality ourselves easily without modifying LLPhant?

MaximeThoonsen commented 1 month ago

Hey @synio-wesley ,

This makes a lot of sense, even more since the context windows of a lot of LLMs have increased by a lot. It is not native yet. This would require to change the splitting step and the storing step so we can find parents very easily. Did you have a technical solution in mind?

synio-wesley commented 1 month ago

@MaximeThoonsen yes I have something running locally which seems to work ok for my purposes. But it might require some tweaking and I'm not sure if it works as well with other vector DBs (I've been working/testing with Redis)

f-lombardo commented 1 month ago

This feature would be really interesting. I'm not sure to understand how the particular kind of vector store could influence it though, @synio-wesley could you elaborate?

synio-wesley commented 1 month ago

@f-lombardo because we somehow have to retrieve the related pieces.. but I might not be doing it in the best way yet.

Right now for myself I have implemented it in such a way without a modification to DocumentSplitter and not changing the way documents are stored.

I added a fetchDocumentsByChunkRange(string $sourceType, string $sourceName, int $chunkStart, int $chunkEnd) method to VectorStoreBase with a default implementation that returns an empty array.

For RedisVectorStore I fetch chunks with matching $sourceType and $sourceName from chunk $chunkStart to $chunkEnd using Predis\Client::jsonmget().

And then I have a SlidingWindowTransformer which implements my custom RetrievedDocsTransformer. Basically it has a transformDocs(VectorStoreBase $vectorStore, array $docs) method and the constructor accepts a $windowSize argument and then I fetch a larger window around the retrieved documents based on the given $windowSize which is basically how many extra chunks before and after I want to fetch.

After fetching them, I filter away duplicate chunks. I also group the fetched documents in order of importance (how they were retrieved originally) and chunk number so it makes the most sense.

I also have a ChunkDeduplicationTransformer that checks overlaps of the chunks that are now in order (or if used without `SlidingWindowTransformer it could first re-order the chunks as well) and it cuts the overlaps so the resulting context text is more logical when used with the overlapping functionality I created before.

I basically run the SlidingWindowTransformer followed by the ChunkDeduplicationTransformer using a SequentialTransformer where you can put multiple $transformers in the constructor, which will be run in sequential order.

For my purposes, this works great. But I don't know if this approach is the best approach for everyone and I don't know if it's equally easy to find other chunks of the same document easily with other vector stores. I've only been working with RAG for a very short while so it's new territory for me.

I'm not 100% happy about the API I have created for myself. But for the project I'm using this for, my approach works well. There's a commercial competitor that I'm comparing against and my results are consistently way better after all these modifications.

f-lombardo commented 1 month ago

@synio-wesley thank you for the clarification. I'm still confused about how to obtain this result, but I think we should create a solution that is Vectorstore agnostic. @MaximeThoonsen @synio-wesley what is your opinion?

synio-wesley commented 1 month ago

Of course vector store agnostic is great. But in any case we will need to implement a new method for all vector stores so we can fetch related docs right? Depending on how we store them it could be simpler or not with the different vector stores. For Redis I didn't need any adjustment to saving of the docs, at least not for SlidingWindowTransformer. But other types of small-to-big retrieval might be different. And other stores maybe as well. Maybe we could discuss the feature a little bit somewhere? But I'm on holiday for 2 weeks now so not a lot of time.

MaximeThoonsen commented 1 month ago

The SlidingWindowTransformer make sense only if have very big document right? Or I am missing something? @synio-wesley

f-lombardo commented 1 month ago

in any case we will need to implement a new method for all vector stores so we can fetch related docs

Yes, of course this is one option. Another one is to have a DocumentStore that differs from the VectorStore, as in LangChain: https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/#retrieving-full-documents

I'm not sure which one could be the best solution.

f-lombardo commented 1 month ago

The SlidingWindowTransformer make sense only if have very big document right? Or I am missing something? @synio-wesley

I think so, even if the concept of "big" may differ a lot based on various use cases.

synio-wesley commented 1 month ago

The SlidingWindowTransformer make sense only if have very big document right? Or I am missing something? @synio-wesley

I think so, even if the concept of "big" may differ a lot based on various use cases.

I am not a RAG expert, but as far as I understand, if you make smaller blocks/chunks of text (few sentences) then the vector that gets calculated for it makes more sense, because the chance of multiple different concepts being inside of one chunk gets smaller.

But if we would then only use this small chunk to give as context to the LLM, that would be too small and contain not enough information. So that's why some chunks before get prepended to it, and some chunks after get appended to it.

You could also grab the whole parent document (all chunks) but that will be a lot of content, especially if you retrieved multiple candidates and want to retrieve the parent documents for all of them and they are different documents. For some queries, chunks in multiple documents might be needed to answer the query correctly. Then you would end up with multiple full documents (in my case scraped webpages) that might become quite large for the context, making it rather expensive as well (which is a concern in my application).

That's why for my application the SlidingWindowTransformer approach seems to work well. The calculated vectors are more aligned with the content of the smaller chunks. And then I make the chunk bigger by retrieving extra chunks around the 3 best chunk candidates. And that result is given to the LLM as context to work with.

synio-wesley commented 1 month ago

in any case we will need to implement a new method for all vector stores so we can fetch related docs

Yes, of course this is one option. Another one is to have a DocumentStore that differs from the VectorStore, as in LangChain: https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/#retrieving-full-documents

I'm not sure which one could be the best solution.

Only glanced at that page you linked to, but it looks like LangChain is allowing 2 different stores, one for the retrieval of the child docs/chunks and one for grabbing the parent docs. This might make a lot of sense because different stores might be optimized for different things. A good vector retrieval store might be different from another type of DB that is good at fetching the parent docs based on the ID of the child doc.

In my current implementation everything is a bit simplified and tailored to my own use case, but I like the idea of allowing the use of 2 different stores for these 2 different functionalities. And I guess you could still opt to use the same store for both as well. At least the underlying DB could be the same I mean, as long as the DB supports both functionalities (like Redis would)