run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.86k stars 5.09k forks source link

[Bug]: KnowledgeGraphIndex not returning triplets in networkx when using documentation example #13806

Closed csaiedu closed 3 months ago

csaiedu commented 4 months ago

Bug Description

When running : https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/knowledge_graph2/

The networkx graph return is displayed blank. No triplet seem to be passed to the knowledge graph. The extract_triplet function seem incorrect: Replacing by hugging face recommended function for extraction works with logged debug showing triplers are extracted

Extracted triplets: [{'head': 'Attention Mechanisms', 'type': 'subclass of', 'tail': 'attention mechanisms'}]

However, still nothing is returned when calling g = index.get_networkx_graph()

def extract_triplets(input_text):
    text = triplet_extractor.tokenizer.batch_decode(
        [
            triplet_extractor(
                input_text, return_tensors=True, return_text=False
            )[0]["generated_token_ids"]
        ]
    )[0]

    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

Version

0.10.40

Steps to Reproduce

Trying to run the referred code form the manual

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 4 months ago

Hey there, @csaiedu! I'm here to help you with your bug report. I'm a bot that can assist with bug fixes, answer questions, and guide you on becoming a contributor. Let's take a look at the issue you've encountered.

The issue you're encountering with the KnowledgeGraphIndex not returning triplets in a networkx graph is likely due to the extract_triplets function not integrating properly with the KnowledgeGraphIndex class. The KnowledgeGraphIndex class expects triplets in a specific format and stores them in its own structure, but it does not inherently convert these triplets into a networkx graph.

To resolve this, you need to ensure that the extract_triplets function is correctly extracting triplets and that these triplets are being properly inserted into the KnowledgeGraphIndex. Additionally, you need to use the get_networkx_graph method provided by the KnowledgeGraphIndex class to generate the networkx graph.

Here is an example of how you can integrate the extract_triplets function and generate a networkx graph:

  1. Define the extract_triplets function:

    def extract_triplets(input_text):
        text = triplet_extractor.tokenizer.batch_decode(
            [
                triplet_extractor(
                    input_text, return_tensors=True, return_text=False
                )[0]["generated_token_ids"]
            ]
        )[0]
    
        triplets = []
        relation, subject, relation, object_ = '', '', '', ''
        text = text.strip()
        current = 'x'
        for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
            if token == "<triplet>":
                current = 't'
                if relation != '':
                    triplets.append((subject.strip(), relation.strip(), object_.strip()))
                    relation = ''
                subject = ''
            elif token == "<subj>":
                current = 's'
                if relation != '':
                    triplets.append((subject.strip(), relation.strip(), object_.strip()))
                object_ = ''
            elif token == "<obj>":
                current = 'o'
                relation = ''
            else:
                if current == 't':
                    subject += ' ' + token
                elif current == 's':
                    object_ += ' ' + token
                elif current == 'o':
                    relation += ' ' + token
        if subject != '' and relation != '' and object_ != '':
            triplets.append((subject.strip(), relation.strip(), object_.strip()))
        return triplets
  2. Initialize the KnowledgeGraphIndex with the custom extract_triplets function:

    from llama_index.core.indices.knowledge_graph.base import KnowledgeGraphIndex
    
    # Initialize the KnowledgeGraphIndex with the custom triplet extraction function
    kg_index = KnowledgeGraphIndex(
        kg_triplet_extract_fn=extract_triplets,
        # other parameters as needed
    )
    
    # Add documents to the index (assuming `documents` is a list of Document objects)
    kg_index = KnowledgeGraphIndex.from_documents(documents)
    
    # Ensure networkx is installed
    try:
        import networkx as nx
    except ImportError:
        raise ImportError("Please install networkx to visualize the graph: `pip install networkx`")
    
    # Generate the networkx graph
    graph = kg_index.get_networkx_graph(limit=100)
  3. Visualize the networkx graph:

    import matplotlib.pyplot as plt
    
    nx.draw(graph, with_labels=True)
    plt.show()

By following these steps, you should be able to extract triplets using your custom extract_triplets function, insert them into the KnowledgeGraphIndex, and generate a networkx graph for visualization [1][2].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

csaiedu commented 4 months ago

This return an error. Within the initialisation of KnowledgeGraphIndex, which according to the suggestions are separated from the from_documents function call: ValueError: One of nodes, objects, or index_struct must be provided.

dosubot[bot] commented 4 months ago

To resolve the error "One of nodes, objects, or index_struct must be provided" and properly initialize the KnowledgeGraphIndex, you need to ensure that at least one of the parameters nodes, objects, or index_struct is provided when initializing the KnowledgeGraphIndex.

Here is an example of how you can initialize the KnowledgeGraphIndex with a list of nodes and then generate a networkx graph:

from llama_index.core.schema import BaseNode
from llama_index.core.service_context import ServiceContext
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core.indices.knowledge_graph.base import KnowledgeGraphIndex

# Example nodes
nodes = [
    BaseNode(content="Node 1 content"),
    BaseNode(content="Node 2 content"),
    # Add more nodes as needed
]

# Initialize the KnowledgeGraphIndex
kg_index = KnowledgeGraphIndex(
    nodes=nodes,
    service_context=ServiceContext(),  # Provide a valid ServiceContext if needed
    storage_context=StorageContext()   # Provide a valid StorageContext if needed
)

# Generate the networkx graph
graph = kg_index.get_networkx_graph(limit=100)

# Now you can work with the generated graph
print(graph.nodes)
print(graph.edges)

In this example, the nodes parameter is provided to the KnowledgeGraphIndex constructor, which satisfies the requirement that one of nodes, objects, or index_struct must be provided. After initializing the index, the get_networkx_graph method is called to generate the networkx graph [1][2].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

csaiedu commented 4 months ago

This is not helping

logan-markewich commented 4 months ago

@csaiedu your extract_triplets function is incorrect

It should be returning a list of tuples (subject, relation, object), not a list of dicts