run-llama / LlamaIndexTS

LlamaIndex in TypeScript
https://ts.llamaindex.ai
MIT License
1.74k stars 331 forks source link

Aligning LlamaIndex Metadata Structure with Underlying Database Capabilities to Support Arrays of Objects #659

Open TYRONEMICHAEL opened 5 months ago

TYRONEMICHAEL commented 5 months ago

Description:

Issue Summary:

We are utilizing LlamaIndex as an interface for various vector database implementations, including ChromaDb. While ChromaDb supports a flexible metadata structure that allows for arrays of objects, enabling rich and complex metadata associations, we've identified a limitation within LlamaIndex's metadata handling. The current Record<string, any> type definition for metadata in LlamaIndex restricts us to a flat key-value pair structure, which does not fully leverage the underlying databases' capabilities, particularly ChromaDb's ability to handle arrays of objects within metadata.

ChromaDb's Metadata Capabilities:

ChromaDb allows for a diverse range of metadata structures, as demonstrated by the following usage pattern:

await collection.upsert({
  ids: ["id1", "id2", "id3"],
  embeddings: [[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
  metadatas: [
    { "chapter": "3", "verse": "16" },
    { "chapter": "3", "verse": "5" },
    { "chapter": "29", "verse": "11" }
  ],
  documents: ["doc1", "doc2", "doc3"]
});

This flexibility in metadata structure allows users to associate multiple related attributes with a single document, enhancing the expressiveness and utility of the metadata.

Proposed Enhancement for LlamaIndex:

To bridge this gap and align LlamaIndex more closely with the capabilities of ChromaDb and potentially other databases, I propose we consider extending the metadata type definition in LlamaIndex to Record<string, any>[]. This adjustment would permit an array of metadata objects, each maintaining a flat structure, thereby respecting the underlying databases' constraints while offering enhanced flexibility and expressiveness in metadata definition.

Potential Benefits:

Seeking Input and Suggestions:

I am keen to hear the community's thoughts on this proposal, any potential challenges it might pose, and how it might be implemented effectively. Suggestions for alternative approaches that could resolve the issue are also highly welcome.

marcusschiesser commented 5 months ago

Thanks for your suggestion @TYRONEMICHAEL.

I think for a change like that, we need to consider at least the following:

  1. Can the data generated by the TS version be used with the Python version of LlamaIndex?
  2. Can the data generated by the Python version be used by the TS version?
  3. Does it not break the existing usage?
  4. Do other vector DBs benefit from the change?

About 1. and 2. I just took a look at the Python code, it's also using the first entry of the metadatas array, see https://github.com/run-llama/llama_index/blob/337936b013843fbc7aece81117140106803715ef/llama-index-integrations/vector_stores/llama-index-vector-stores-chroma/llama_index/vector_stores/chroma/base.py#L336 - we have to consider that

About 3. Instead of Record<string, any>[] we could probably use Record<string, any>[] | Record<string, any>

Regarding 4. it would be great to hear the thoughts of users of other Vector DBs.