run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.42k stars 4.99k forks source link

Handling Excel Data with KnowledgeGraphIndex from LlamaIndex #15780

Open LikhithRishi opened 1 week ago

LikhithRishi commented 1 week ago

Question Validation

Question

I'm currently using KnowledgeGraphIndex from the LlamaIndex library to build a knowledge graph. Previously, I worked with PDF documents and used the SimpleDirectoryReader to load data. Now, I'm transitioning to using an Excel file, which I read into a pandas DataFrame.

import pandas as pd

# Load data from Excel file
df = pd.read_excel('path_to_excel_file.xlsx')

The problem is that KnowledgeGraphIndex expects documents in a specific format, but I now need to provide it with data from a DataFrame. How can I convert this DataFrame into a format compatible with KnowledgeGraphIndex? Specifically, I need to know how to transform the DataFrame into a list of documents or a similar structure that KnowledgeGraphIndex can work with. So that it has to take the relationships in columns

I am currently using the following code for creating the index with documents:

from llama_index import KnowledgeGraphIndex

# Assume 'documents' is a list of document texts or similar
index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=2,
    storage_context=storage_context,
)

Question:

How should I modify my DataFrame to fit this input format? Is there a recommended approach for converting DataFrame data into a format suitable for KnowledgeGraphIndex?

dosubot[bot] commented 1 week ago

To convert your pandas DataFrame into a format compatible with KnowledgeGraphIndex, you need to transform each row of the DataFrame into a document structure that the KnowledgeGraphIndex can accept. The expected format for documents is JSON, with fields such as doc_id, embedding, extra_info, index, child_indices, ref_doc_id, and node_info.

Here's a step-by-step approach to achieve this:

  1. Define a function to convert DataFrame rows to the required document format:
import pandas as pd
import uuid

def df_row_to_document(row):
    return {
        "doc_id": str(uuid.uuid4()),  # Generate a unique ID for each document
        "embedding": None,  # Assuming no embedding is provided
        "extra_info": None,  # Assuming no extra info is provided
        "index": row.name,  # Use the row index as the document index
        "child_indices": [],  # Assuming no child indices
        "ref_doc_id": None,  # Assuming no reference document ID
        "node_info": {
            "start": 0,  # Assuming start position is 0
            "end": len(row)  # Assuming end position is the length of the row
        }
    }

# Load data from Excel file
df = pd.read_excel('path_to_excel_file.xlsx')

# Convert DataFrame rows to documents
documents = [df_row_to_document(row) for _, row in df.iterrows()]
  1. Create the KnowledgeGraphIndex using the converted documents:
from llama_index import KnowledgeGraphIndex, StorageContext

# Assuming you have a storage context configured
storage_context = StorageContext.from_defaults()

# Create the KnowledgeGraphIndex
index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=2,
    storage_context=storage_context,
)

This approach ensures that each row in your DataFrame is converted into a document format that KnowledgeGraphIndex can work with. You may need to adjust the df_row_to_document function to better fit your specific data and requirements.

For more details on the expected document format, you can refer to the example provided in the documentation:

{
  "doc_id": "06f6c175-2d00-4f5f-b95a-0c4b11e66daf",
  "embedding": null,
  "extra_info": null,
  "index": 11,
  "child_indices": [],
  "ref_doc_id": "42082e33-8e2a-412d-924e-5189ca79b445",
  "node_info": { "start": 21911, "end": 23907 }
}

[1]

To continue talking to Dosu, mention @dosu.