[Feature Request]: Support Multiple Embeddings per Node

philipchung commented 5 months ago

Feature Description

Store multiple embeddings for the same BaseNode and all of its derivatives (TextNode, Document, ImageNode, etc.).

Currently BaseNode has a field embedding which is type-enforced as Optional[List[float]] by Pydantic. This restricts each node to having only a single embedding.

I would like to store multiple embeddings for the same node (e.g. dense embedding, sparse embedding, etc.). These embeddings may have different dimensionality or format (e.g. sparse embeddings are often represented as a key-value mapping instead of array). This could be implemented by allowing the embedding field on BaseNode to be compatible with a dictionary of embeddings in addition to the current single-embedding implementation (e.g. Optional[List[float] | Dict[str, Any]]).

The proposal is to accomplish something similar to the API provided by Qdrant Vectorstore where multiple embedding vectors can be associated with a single payload (analogous to llama_index node).

Reference from Qdrant:

The alternative to modifying the base node classes would be to enable easy addition of custom node types into llama-index. Currently this is possible, but is not straightforward because many llama-index components (node parsers, splitters, ingestion cache, ingestion pipeline, etc.) all assume the default node types and validate node types using pydantic. Introduction of custom node types currently requires subclassing most of the llama-index components to support custom node types.

Reason

I am unsure if changing the field type for BaseNode.embeddings will have unintended effects across the rest of llama-index components. Many components may assume that the embeddings field contains a single embedding instead of a dictionary of multiple embeddings.

Value of Feature

Avoid requiring users to heavily customize llama-index in order to support multi-embedding applications (e.g. hybrid search). Currently the Qdrant vectorstore implementation implements a special work-around to enable hybrid mode by generating sparse embeddings within the Qdrant vectorstore implementation. This assumes 2 embedding models will be used (one for dense and one for sparse). With the new BAAI/BGE-M3 embedding, multiple embeddings can be generated from a single model call, which would make the current Qdrant vectorstore hybrid implementation inefficient.

dosubot[bot] commented 5 months ago

🤖

Hello @philipchung! 👋 I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer to take a deeper look into your request. I'm here to help solve bugs, answer questions, and guide you on becoming a contributor. Don't worry, I'll take a close look at your issue and provide you with a comprehensive answer shortly.

logan-markewich commented 5 months ago

Hey @philipchung -- its not a bad idea.

Changes to core node objects are always a huge pain though... making it a union type means we have to type-check the attribute every time we want to access it.

Maybe a better idea is something like adding a new attribute? Although once it's added, it's never being removed (at least not without a ton of work).

dosubot[bot] commented 5 months ago

🤖

Thank you for your detailed feature request. I understand the need for supporting multiple embeddings per node in the LlamaIndex framework.

Modifying the embedding field on BaseNode to support a dictionary of multiple embeddings could indeed be a viable solution. However, as you've pointed out, this could have unintended effects across the rest of the LlamaIndex components. Many components may assume that the embeddings field contains a single embedding instead of a dictionary of multiple embeddings.

Changing the embedding field from Optional[List[float]] to Optional[List[float] | Dict[str, Any]] could have several impacts on the LlamaIndex system:

Type Checking and Validation: The type checking and validation would need to be updated to handle the new type. This could potentially introduce bugs if not handled correctly.
Data Consistency: The data returned by the methods _get_query_embedding, _aget_query_embedding, _get_text_embedding, _aget_text_embedding, _get_text_embeddings, and _aget_text_embeddings would no longer be consistent. Some methods might return a list of floats, while others might return a dictionary. This could lead to confusion and potential errors when using these methods.
Data Processing: Any code that processes the embedding field would need to be updated to handle both lists of floats and dictionaries. This could increase the complexity of the code and make it harder to maintain.
API Compatibility: If the embedding field is exposed through an API, changing its type could break compatibility with existing clients that expect a list of floats.

As for adding custom node types into LlamaIndex to support multi-embedding applications, the system already supports different types of nodes, such as ImageNode and BaseNode. To add a new custom node type, you would likely need to define a new class for this node type that inherits from BaseNode or another appropriate superclass, and then adjust the methods in MultiModalVectorStoreIndex and other relevant classes to handle this new node type. However, without more information about the structure and requirements of these custom node types, it's not possible to provide a concrete code example.

In conclusion, while it might be possible to modify the embedding field on BaseNode to support a dictionary of multiple embeddings and to add custom node types into LlamaIndex to support multi-embedding applications, there are several considerations to keep in mind. I would recommend discussing this feature request further with the LlamaIndex team to determine the best approach.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Panoplos commented 3 months ago

I second this request. The reality is the current implementation doesn't support multimodal embeddings, either. In our case, we have one text based embedding and multiple image embeddings per OpenSearch document (node). There is currently no way I can implement using your library, as is. I would have to create an entirely new OpenSearch implementation...yet, OS does support this query structure natively.

logan-markewich commented 3 months ago

@Panoplos the current approach is to embed the text and embed the image in separate collections/namespaces/indexes

https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/?h=multimodal

Opensearch should technically support that just fine

Panoplos commented 3 months ago

Understood, but does LI's OpenSearch client support multiple image embeddings per document (in array format), or do I need to flatten the array across multiple documents?

Panoplos commented 3 months ago

Because, if the latter, then it will make top_k comprehension cumbersome...

logan-markewich commented 3 months ago

@Panoplos no, it'd have to be separate tables/collections/etc.

The query vector used has to match the embeddings (i.e clip for text/images, openai for pure text)

So the retrieval step already inherently needs two embedding and retrieval calls (which is handled in the above example)

PRs to improve this are welcome. Id say probably half of our supported vector dbs have strict requirements that only allow for a single embedding field per table/collection/index

philipchung commented 1 month ago

What if the embedding field on BaseNode was changed to a property which has getter/setter that stores a dense embedding vector used in typical RAG applications in a new field embeddings which is of type dict[str, Any] under the key "default".

embeddings: dict[str, Any] = {
  "default": [...],
  "custom_dense_embedding1": [...],
  "custom_sparse_embedding2": [...],
}

To allow selective retrieval of embeddings, the get_embedding method on BaseNode needs to be modified to accept an argument that allows user to specify the embeddings key and select "default" if no key supplied.

Would this allow for compatibility with existing LlamaIndex components and vector stores?

run-llama / llama_index