[Question]: Graph Data Processing

jjoaqu7 commented 4 weeks ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I have gotten the following SCRIPT to work however I have a few questions about pre-processing and structuring my own data that I would like to use in place of the wikipedia page used in the script.

Does structuring the data in more organized manners and then loading into graph help for returns? If so, what kind of structuring could be done on data like lecture content/notes on a particular engineering course?
How would I structure this data efficiently and effectively like extracting entity and relationships plus some categorization, in particular, for a graph store based on this lecture content/notes, what methods could I use, does it have to be manually?
After I extract these things and categorize the data how would I structure it, in a nested dictionary, JSON, xml, or as something completely different, considering I already have a nested JSON structure of my lecture data.
Does Llama Index offer anything that may help in any of the above or any tasks associated? Does Llama Index + nebula KG store require a particular format for loading structured data?

SCRIPT:

import os
import logging
import sys
import pickle
import zlib
import hashlib
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever
from llama_index.core import download_loader
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core import KnowledgeGraphIndex
from llama_index.core import StorageContext
from llama_index.graph_stores.nebula import NebulaGraphStore
from llama_index.core.prompts.base import PromptTemplate, PromptType
from llama_index.core.response_synthesizers import TreeSummarize
from llama_index.core.query_engine import BaseQueryEngine

os.environ['OPENAI_API_KEY'] = "sk-"

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
Settings.chunk_size = 512

loader = WikipediaReader()
documents = loader.load_data(pages=["Guardians of the Galaxy Vol. 3"], auto_suggest=False)
print(documents[0] if documents else "No documents loaded")

os.environ["NEBULA_USER"] = "#"
os.environ["NEBULA_PASSWORD"] = "#"
os.environ["NEBULA_ADDRESS"] = "#"

space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], ["relationship"]
tags = ["entity"]

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

prompt_template = PromptTemplate("""
Given the context below, answer the question in a detailed manner, covering all key aspects mentioned in the question. Use the context to provide accurate and relevant information.

Context:
{context_str}

Question:
{query_str}

Please provide your answer in a detailed paragraph, including all relevant information from the context. For example, if asked about a character's background, include significant events, relationships, and traits that define them.

Answer:
""")

graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    include_text=True,
    similarity_top_k=10,
    verbose=True,
    explore_global_knowledge=True,
)

tree_summarize = TreeSummarize(
    summary_template=prompt_template,
    llm=Settings.llm,
    verbose=True
)

query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever,
    response_synthesizer=tree_summarize
)

response1 = query_engine.query("""
    Who is Peter Quill, who plays him in the movie, and does Peter Quill love anyone or has he loved anyone?
""")
print("Response 1\n", response1)

dosubot[bot] commented 4 weeks ago

To address your questions about pre-processing and structuring your own data for use with Llama Index and Nebula Graph Store, here are the steps and recommendations:

Structuring Data for Better Returns:
- Yes, structuring your data in an organized manner can improve the quality of returns. For lecture content/notes, you can structure the data by extracting key entities (e.g., concepts, terms) and relationships (e.g., dependencies, hierarchies) and categorizing them appropriately.
Efficient Data Structuring:
- Entity and Relationship Extraction: You can use NLP techniques to extract entities and relationships. Tools like spaCy or NLTK can help automate this process.
- Categorization: Categorize the extracted entities and relationships into relevant categories (e.g., topics, subtopics).
- Manual vs. Automated: While manual extraction ensures accuracy, automated methods can save time. A hybrid approach might be best, where you use automated tools for initial extraction and then manually refine the results.

Data Format for Structuring:

Nested JSON: Since you already have a nested JSON structure, you can continue using it. JSON is a flexible and widely supported format for structured data.

Example JSON Structure:

{
"course": "Engineering Course",
"lectures": [
 {
   "title": "Lecture 1",
   "content": "Introduction to Thermodynamics",
   "entities": [
     {"name": "Thermodynamics", "type": "Concept"},
     {"name": "First Law", "type": "Law"}
   ],
   "relationships": [
     {"source": "Thermodynamics", "target": "First Law", "type": "Defines"}
   ]
 }
]
}

Llama Index and Nebula Graph Store Integration:
- Data Loading Format: Llama Index and Nebula Graph Store do not require a specific format for loading structured data, but JSON is commonly used.
- Llama Index Capabilities: Llama Index offers tools to create and manage indices, including KnowledgeGraphIndex, which can be used to structure and query your data efficiently.

Here is an example script to replace Wikipedia data with your lecture content/notes and load it into Nebula Graph Store:

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.graph_stores.nebula import NebulaGraphStore
from llama_index.core.vector_stores.simple import SimpleVectorStore
from llama_index.core.indices.property_graph import PropertyGraphIndex

# Configure LLM and embedding models
Settings.llm = OpenAI(model="gpt-4o", temperature=0.3)
Settings.embed_model = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large")

# Load your lecture content
documents = SimpleDirectoryReader("./data/your_lecture_notes/").load_data()

# Initialize Nebula Graph Store
graph_store = NebulaGraphStore(
    space_name="llamaindex_nebula_property_graph", 
    edge_types="your_edge_types", 
    rel_prop_names="your_rel_prop_names", 
    tags="your_tags"
)

# Initialize Vector Store
vec_store = SimpleVectorStore()

# Create Property Graph Index from documents
index = PropertyGraphIndex.from_documents(
    documents,
    property_graph_store=graph_store,
    vector_store=vec_store,
    show_progress=True,
)

# Persist the vector store
index.storage_context.vector_store.persist("./vec_store.json")

# Example query
query = "Explain the first law of thermodynamics."
retrieved = index.as_retriever().retrieve(query)
answer = index.as_query_engine().query(query)
print(retrieved, answer)

This script demonstrates how to load your structured lecture content into Nebula Graph Store and query it using Llama Index ^[1]^[2]^[3]. Adjust the paths and parameters as needed for your specific use case.

To continue talking to Dosu, mention @dosu.

jjoaqu7 commented 4 weeks ago

Can I load into the llama index documents loader JSON files, can you modify that script you provided to load a series of JSON files from a directory called "JSONs"? @dosubot

dosubot[bot] commented 4 weeks ago

Certainly! Here is the modified script to load a series of JSON files from a directory called "JSONs" into the Llama Index documents loader:

import os
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.graph_stores.nebula import NebulaGraphStore
from llama_index.core.vector_stores.simple import SimpleVectorStore
from llama_index.core.indices.property_graph import PropertyGraphIndex
from llama_index.readers.json import JSONReader

# Configure LLM and embedding models
Settings.llm = OpenAI(model="gpt-4o", temperature=0.3)

# Directory containing JSON files
json_directory = "JSONs"

# Initialize the JSONReader
reader = JSONReader()

# Load all documents from the JSON files in the directory
documents = []
for filename in os.listdir(json_directory):
    if filename.endswith(".json"):
        file_path = os.path.join(json_directory, filename)
        documents.extend(reader.load_data(file_path))

# Initialize Nebula Graph Store
graph_store = NebulaGraphStore(
    space_name="llamaindex_nebula_property_graph", 
    edge_types="your_edge_types", 
    rel_prop_names="your_rel_prop_names", 
    tags="your_tags"
)

# Initialize Vector Store
vec_store = SimpleVectorStore()

# Create Property Graph Index from documents
index = PropertyGraphIndex.from_documents(
    documents,
    property_graph_store=graph_store,
    vector_store=vec_store,
    show_progress=True,
)

# Persist the vector store
index.storage_context.vector_store.persist("./vec_store.json")

# Example query
query = "Explain the first law of thermodynamics."
retrieved = index.as_retriever().retrieve(query)
answer = index.as_query_engine().query(query)
print(retrieved, answer)

This script will load all JSON files from the "JSONs" directory into the Llama Index documents loader and proceed with the rest of the indexing and querying process ^[1]. Adjust the paths and parameters as needed for your specific use case.

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: Graph Data Processing #14033

Question Validation

Question