[Question]: Knowledge Graph Vs RAG

vishnu9000 commented 1 year ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hi guys I am new to llama index and llm. I am working on an use case for document question answering and I am confused on which path to take. I am lot of excel files that contain test case details like component, test case id, test case name, test case steps, expected results. I want to create a QA system such that it should be able to answer following questions:

Is this bug covered by any test case?
Is there any similar test case related to this?
List me all test case that need to be tested for a component.
Give me list of minimum test cases that need to be tested when change x happens?

Here first 2 questions can be done with simple vector store. But last two cannot be achieved by vector store due to limitations of vector store RAG where chunks are save. Let say I said give me test case for y component. The pipeline will retrieve 5 best matched chunk. But I may have 20 such test cases. And for last question the model might need to go to multiple chunks to get the accurate results. In my case for each test case component, test case id, test case name, test case steps, expected results is a chunk so itself will be having a good amount of size.

That when I thought of Knowledge Graph and looked into llama index framework. I am using local llama 2 quantized model for security reasons of data. I have checked out following sections.

General Knowledge Graph. Here I saw the graph is not getting generated with all the information in text. It skips of texts and really confusing how the graph is generated. Following is an example taken from llama index doc itself:

I have tried to add triplets manually as in docs node_0_tups = [ ("author", "worked on", "writing"), ("author", "worked on", "programming"), ] for tup in node_0_tups: index.upsert_triplet_and_node(tup, nodes[0])

But this the graph its generating, I am really confused by how the relation ship became like this.

I also tried with local model for extracting triplets from text like REBEL but still all information are not extracted properly.

I was hoping KG could help with with all the use case I have listed above. Can some one help me with this. I am stuck. Am I doing anything wrong? Is this not the correct approach?

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for your interest in LlamaIndex and for your detailed question. It seems like you're on the right track with your approach to using a Knowledge Graph (KG) for your use case.

To address your concerns:

Missing Information in the Graph: The graph generation process in LlamaIndex is based on extracting triplets from the text. This process might not capture all the information in the text, especially if the information is not in a format that can be easily converted into a triplet. This is a known limitation of the current version of LlamaIndex.
Adding Triplets Manually: The example you provided is a manual way of adding triplets to the graph. The resulting graph might not make sense if the triplets are not related or if the relationships between the nodes are not clearly defined.
Using Local Model for Extracting Triplets: The effectiveness of the local model in extracting triplets can vary depending on the quality and format of the input text. If the text is not well-structured or if it contains complex sentences, the model might not be able to extract all the relevant triplets.

For your use case, you might need to preprocess your data to ensure that it is in a format that can be easily parsed by the model. This might involve converting your excel data into a structured format like JSON or CSV, and structuring your text in a way that makes it easier for the model to extract the relevant triplets.

Here's a sample code snippet on how you can use LlamaIndex for your use case:

from llama_index import (
    KnowledgeGraphIndex,
    LLMPredictor,
    ServiceContext,
    SimpleDirectoryReader,
)
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore
from llama_index.llms import OpenAI

llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size_limit=512)

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=10,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

from llama_index.query_engine import KnowledgeGraphQueryEngine

query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

response = query_engine.query(
    "Your question here",
)

Please replace the placeholders with your actual data and parameters.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for your detailed question. It seems like you're on the right track with using LlamaIndex's Knowledge Graph (KG) for your use case.

For your first two questions, as you mentioned, a simple vector store or even a traditional information retrieval system might suffice. However, for the last two questions, you're correct that a more complex system like a KG would be beneficial due to the need to retrieve multiple related chunks of information.

In terms of the issues you're facing with the KG, it seems like there might be some confusion about how the graph is generated and how triplets are extracted from the text.

The graph is generated based on the triplets that are extracted from the text. Each triplet consists of a subject, predicate, and object, which form the nodes and edges of the graph. The upsert_triplet_and_node method is used to insert these triplets into the graph. If the graph is not being generated as expected, it might be due to the way the triplets are being extracted or inserted.

The extraction of triplets is handled by the _extract_triplets method in the KnowledgeGraphIndex class. This method uses a function (_kg_triplet_extract_fn) if provided, otherwise it uses the _llm_extract_triplets method. The _llm_extract_triplets method uses the LLM predictor to predict triplets from the text based on the kg_triple_extract_template. The predicted response is then parsed by the _parse_triplet_response method to extract the triplets.

There are a few limitations to this process that might be causing the issues you're seeing:

The maximum number of triplets that can be extracted per chunk is defined by the max_triplets_per_chunk parameter. If the text contains more triplets, they will not be extracted.
The maximum length of the object in a triplet is defined by the max_object_length parameter. If the object in a triplet exceeds this length, the triplet will be skipped.
The extraction process relies on the LLM predictor and the kg_triple_extract_template. If the predictor or the template is not accurate or comprehensive, the extraction process may not be able to extract all the triplets correctly.
The extraction process only extracts triplets that are properly formatted (i.e., enclosed in parentheses and separated by commas). If a triplet is not properly formatted, it will be skipped.

You can find more details about this process in the KnowledgeGraphIndex class.

As for the image you've shared, I'm sorry but I'm unable to view images. Could you please describe the issue you're facing with the graph in more detail?

I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

vishnu9000 commented 1 year ago

@dosu-bot So for my point 3 & 4 is this the best approach? I have tried to manually add triplets in graph

node_0_tups = [ ("author", "worked on", "writing"), ("author", "worked on", "programming"), ] for tup in node_0_tups: index.upsert_triplet_and_node(tup, nodes[0])

this is the same code from llama-index I am using. But the generated graph is having a relation like this: the two nodes are author and worked on and in arrow its showing relation as author too. What I have seen is that when inserting triplets the first and second elements is taken as node and the first elements is taken as relation for all cases. Is that the correct method in reality should'nt first and third element be taken as node and the second elements as relation? Also what is the use of inserting nodes[0] here-index.upsert_triplet_and_node(tup, nodes[0])

For creating KG:

Either we have to manually enter triplets (in my case I was thinking of entering test case id, "has following steps", test case steps, also component,"has following test case",test case name). Is this right approach?
If I am using llm for extracting triplets either with REBEL as in llama index docs or some other llm model llama2 or gpt would that be better than manually entering data in my case. One issue with I have restriction in running local models only due to data security and I am having a CPU system so thats why I was running llama 2 quanitesd model. Will that work here too?
If I am using an external graph database and integrating it with llama-index, how to create the graph with my data in that database. There also should I manually add it or is there any method specific for that?

logan-markewich commented 1 year ago

@vishnu9000 hmm, I think it's just a bug with drawing the graph tbh, especially since we have unit tests for upserting triplets like that.

vishnu9000 commented 1 year ago

@logan-markewich ohh..so is this a bug with llama index? Any way to rectify it? If this is the case then for other KG options in llama index same bug will be there right?

wey-gu commented 1 year ago

@vishnu9000 could you give it a try on NebulaGraph first?

that's the main graph store I kept changing/iterating/optimizing towards.

Will spend some time polishing the SimpleStore and the mutation of the kg_index later.

vishnu9000 commented 1 year ago

@wey-gu Thanks for the info I will check it out. This might be a dumb question but how to create graph from text for nebulagraph? Should it be done manually? or use an llm for extracting entity, wouldn't that take too much time? I have 100 excels with test case id, component, test case name, steps , results how to create a graph entity from this?

run-llama / llama_index