run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.53k stars 5.22k forks source link

[Question]: Indexing a nested dictionary dosu #15967

Open jjoaqu7 opened 1 month ago

jjoaqu7 commented 1 month ago

Question Validation

Question

Hi, is it possible using llama index to do the following:

@dosu

dosubot[bot] commented 1 month ago

Yes, it is possible using LlamaIndex to create an organized index for a nested dictionary where articles are keys and nested dictionaries contain parameters/techniques as keys with their details as a list of strings, to enable quick and efficient retrieval of relevant parameters/techniques based on user input.

The provided code includes functions and classes that can be adapted to achieve this. Specifically, the LocalIndexState class and the get_retrieval_tool_for_docset function can be used to manage and retrieve information from such an index.

Here is an example of how you might structure and use these components:

from typing import Dict, List, Optional
from dataclasses import dataclass
from llama_index.core.readers import Document
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.tools import BaseTool, ToolMetadata, QueryEngineTool
from llama_index.packs.docugami_kg_rag.helpers.vector_store import get_vector_store
from llama_index.packs.docugami_kg_rag.helpers.fused_summary_retriever import FusedSummaryRetriever
from llama_index.core.vector_stores.types import VectorStoreQueryMode

@dataclass
class LocalIndexState:
    full_doc_summaries_by_id: Dict[str, Document]
    chunks_by_id: Dict[str, Document]
    retrieval_tool_function_name: str
    retrieval_tool_description: str
    reports: List[ReportDetails]

def get_retrieval_tool_for_docset(
    docset_id: str, docset_state: LocalIndexState
) -> Optional[BaseTool]:
    chunk_vectorstore = get_vector_store(docset_id)

    if not chunk_vectorstore:
        return None

    retriever = FusedSummaryRetriever(
        vectorstore=chunk_vectorstore,
        parent_doc_store=docset_state.chunks_by_id,
        full_doc_summary_store=docset_state.full_doc_summaries_by_id,
        search_type=VectorStoreQueryMode.MMR,
    )

    if not retriever:
        return None

    query_engine = RetrieverQueryEngine(retriever=retriever)

    return QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name=docset_state.retrieval_tool_function_name,
            description=docset_state.retrieval_tool_description,
        ),
    )

# Example usage
nested_dict = {
    "Article1": {
        "Parameter1": ["Detail1", "Detail2"],
        "Parameter2": ["Detail3", "Detail4"]
    },
    "Article2": {
        "Technique1": ["Detail5", "Detail6"],
        "Technique2": ["Detail7", "Detail8"]
    }
}

# Convert nested dictionary to Documents and create LocalIndexState
documents = {key: Document(text=str(value)) for key, value in nested_dict.items()}
index_state = LocalIndexState(
    full_doc_summaries_by_id=documents,
    chunks_by_id=documents,
    retrieval_tool_function_name="search_articles",
    retrieval_tool_description="Searches for and returns relevant parameters/techniques from articles.",
    reports=[]
)

# Retrieve tool for a specific docset
retrieval_tool = get_retrieval_tool_for_docset("docset_id", index_state)

This code sets up a LocalIndexState with documents created from the nested dictionary and retrieves a tool for querying the index [1].

To continue talking to Dosu, mention @dosu.

jjoaqu7 commented 1 month ago

Do you recommend me using a vector store here or knowledge graph store?

@dosu

dosubot[bot] commented 1 month ago

I recommend using a vector store for indexing a nested dictionary with articles and their parameters/techniques. Vector stores are efficient for storing and querying embeddings, which is suitable for handling nested dictionaries. Here is an example of how to use a vector store with LlamaIndex and Chroma:

import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

# Load some documents
documents = SimpleDirectoryReader("./data").load_data()

# Initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# Create collection
chroma_collection = db.get_or_create_collection("quickstart")

# Assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create your index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
# Create a query engine and query
query_engine = index.as_query_engine()
response = query_engine.query("What is the meaning of life?")
print(response)

This approach allows you to efficiently store and query embeddings, making it suitable for your use case [1][2].

However, if you need to represent and query complex relationships between entities, a knowledge graph store might be more suitable. You can explore more about Knowledge Graph Indexing in the LlamaIndex documentation under the Knowledge Graph Index section [3][4].

To continue talking to Dosu, mention @dosu.