run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.74k stars 5.27k forks source link

[Feature Request]: Suport for having multiple Vector Indexes in a single Vector Store #16945

Open vaaale opened 1 day ago

vaaale commented 1 day ago

Feature Description

Currently, having more than one VectorIndex in a given VectorStore is impossible, at least not with any other VectorStore than SimpleVectorStore. This is huge drawback in my opinion. Have a look at the following article for reference Complex Query Resolution through LlamaIndex Utilizing Recursive Retrieval, Document Agents, and Sub Question Query Decomposition:

The point of the article is what is depicted in this figure The point of course is that the IndexNodes shown in the left part of the figure is placed in some Index, a Vector Index. When the user submits a query, the retriever will fetch a set of IndexNodes from the Vector Store and traverse those nodes ending up with a set of agents which are invoked to satisfy the users query. This intrinsic mechanism to automatically traverse a hierarchy is in my opinion one of the most attractive features of LlamaIndex. The problem is that it's not fully implemented.

Imagine you want to create an architecture similar to the figure above, only that you have an arbitrary number of levels before reaching the leaf node with the Agent, QueryEngine, Retriever or just some other Index. It could look something like this: Tree Of Life

This architecture is highly relevant in an enterprise environment where you would want to index large amounts of documents, categorized in an arbitrary hierarchy. (And obviously, persisting to JSON files on disk is not an option!). There are of course many other situations where you would want to have an arbitrary number of indices, like Security, Segregating user data, and many more.

If you were to implement this in LlamaIndex, you have to create an arbitrary number of VectorStores, for example AzureAISearchVectorStore, an equal number of StorageContexts, and then keep track of all of these throughout the application.

As far as I can tell, there are a few options to solve this: You could choose to use a VectorStore that supports MetaDataFiltering. This would probably work, but would be super fragile. Especially if used to implement security, or user segregation.

Another solution would be to fix the actual problem, which is the underlying storage architecture in LlamaIndex. This would require quite a bit of work but would without a doubt be the best option. This would also fix several other issues wrt. consistency and would probably also do away with the need to have several different storages (DocStore, IndexStore, VectorStore, GraphStore etc. etc.)

The third option, which is very easy and simple to implement is to propagate the index_id down to the VectorStore when Creating, Retrieving, Updating, and Deleting nodes. This fix would also require very little effort from the maintainers of the dozens of *Store maintainers.

In the case of VectorIndex and VectorIndexStore you would have to add the index_id to the VectorStoreQuery object. (You could then also get rid of the list of node_ids in the VectorStoreQuery which is really, not used for anything as far as I can tell). Even better would be to pass it in through a separate parameter to the respective methods. The various VectorStoreRetrievers would also have to changed, but the changes are minimal.

I have implemented this feature to test the feasibility of it, and it seems to work as expected.

I hope these changes can be implemented. Without them LlamaIndex is really not an option for an enterprise environment which would saden me greatly as LlamaIndex otherwise is a fantastic framework!

One final aspect of this. This functionality would also pave the way to improving proper hierarchical retriever traversal (using the object_map). I implemented a prototype for this as well if someone is interested.

Some related issues: 14238 14943 14238 16441

Value of Feature

Very large for any project that is more than some toy project.

logan-markewich commented 1 day ago

@vaaale curious why you think metadata filtering is fragile? Seems a little unfounded

In reality, metadata filtering is the way that most vector providers suggest to add multi tenancy (and this issue is very similar to multi tenancy) -- https://qdrant.tech/articles/multitenancy/

It's similar to a normal SQL db -- you aren't going to create a table per user, you will put everything on one table and filter on some value.

logan-markewich commented 1 day ago

Propagating the index_id is the same as metadata filtering no?

vaaale commented 8 hours ago

@logan-markewich, I'm not saying metadata filtering in general is fragile. If multi tenancy is what you are looking for, then metadata filtering is an excellent solution. If you want security in an enterprise environment, it must be rock-solid. However, this is not what my feature request is about. What I'm looking for is the ability to have more than one index in the same index store. Not necessarily just vector indexes, but any kind of index type. The way this is implemented in LlamaIndex currently is counter intuitive. For example, say to see the following code (psudo code):

patient_index = VectorStore.from_documents(patient_records)
hospital_procedures_index = VectorStore.from_documents(hospital_procedures_docs)

If a developer sees the code above, the natural intuition would tell him that if I search patient_index I will only get patient records in the response. Same for hospital procedures. But that is not the what you get! Further more, and what makes this even worse is that if you to this using SimpleVectorStore say in the development phase, and the move to AzureAISearchVectorStore in production, the behaviour will change. SimpleVectorStore will give you the behavior you would expect, but AzureAISearchVectorStore will not!

Another example, take the DocumentSummaryIndex. This index is well suited for handling a single document. You could have a workflow that looks something like this:

  1. The user submits a search query: "Give me the patient record for patient X"
  2. Search results are displayed and the user selects on
  3. The user interacts with the document (Q&A or whatever)

A way to implement this could be: (psudo code)

patients_index = VectorStore.create()

for each patient in patient_records:
   patient_summary = DocumentSummaryIndex.from_document(patient)
   patients_index.add(create_index_node_for(patient_summary))

patient_search = patient_index.as_chat_engine()

And again, this is not possible as the document summary index relies on the vector store, and you can only have one! In fact, if you implement this it will seem to work at time of ingestion. But upon loading the index from storage and submitting the query, it will crash with an exception. (Unless this has been addressed. It's been a while since I test this particular case)

The above example is similar to what I tried to describe originally. Except that in the feature request I described structure that in theory could be much more complex. Something like this:


def ingest_file(file):
  // Build indexes for file
  document_summary_index = DocumentSummaryIndex.from_document(file)
  index_node = create_index_node_for(document_summary_index)
  .....
  return [index_node, ......]

def ingest_recursively(current_folder):
   """ Traverse the folder recursively and build a *tree of indexes """
  for each folder in current_folder:
     summaries = []
     folder_index = VectorIndex.create()
     for each file in folder:
      index_nodes = ingest_file(file) 
      folder_index.add_all(index_nodes)

      for each sub_folder in folders:
        sub_folder_index_node = ingest_recursively(sub_folder)
        folder_index.add(sub_folder_index_node)

      folder_index_node = create_index_node_for(folder_index)
  return folder_index_node

root_folder = <some location> 
root_index = VectorIndex.create()

root_index = ingest_recursively(root_folder)

There are of course several permutations that one could think of for the example above, but I hope this serves the purpose of illustrating what I'm after.

I will submit a PR where I have implemented the functionality to enable multiple vector indexes for a single vector store. In it, I have made some small changes to a couple of the base classes like BaseRetriever and BasePydanticVectorStore. I have also updated ChromaVectorStore, PGVectorStore and AzureAISearchVectorStore, as a proof of concept.

The way I have implemented it should be backwards compatible in most cases. (I think)

I'll post the link to the PR here once I get it submitted. It would be great if someone could take a look at it and provide some feedback.

PR #16972