Closed vishnu9000 closed 1 year ago
🤖
Hello,
Thank you for your interest in LlamaIndex and for your detailed question. It seems like you're on the right track with your approach to using a Knowledge Graph (KG) for your use case.
To address your concerns:
Missing Information in the Graph: The graph generation process in LlamaIndex is based on extracting triplets from the text. This process might not capture all the information in the text, especially if the information is not in a format that can be easily converted into a triplet. This is a known limitation of the current version of LlamaIndex.
Adding Triplets Manually: The example you provided is a manual way of adding triplets to the graph. The resulting graph might not make sense if the triplets are not related or if the relationships between the nodes are not clearly defined.
Using Local Model for Extracting Triplets: The effectiveness of the local model in extracting triplets can vary depending on the quality and format of the input text. If the text is not well-structured or if it contains complex sentences, the model might not be able to extract all the relevant triplets.
For your use case, you might need to preprocess your data to ensure that it is in a format that can be easily parsed by the model. This might involve converting your excel data into a structured format like JSON or CSV, and structuring your text in a way that makes it easier for the model to extract the relevant triplets.
Here's a sample code snippet on how you can use LlamaIndex for your use case:
from llama_index import (
KnowledgeGraphIndex,
LLMPredictor,
ServiceContext,
SimpleDirectoryReader,
)
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore
from llama_index.llms import OpenAI
llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size_limit=512)
graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)
kg_index = KnowledgeGraphIndex.from_documents(
documents,
storage_context=storage_context,
max_triplets_per_chunk=10,
service_context=service_context,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)
from llama_index.query_engine import KnowledgeGraphQueryEngine
query_engine = KnowledgeGraphQueryEngine(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)
response = query_engine.query(
"Your question here",
)
Please replace the placeholders with your actual data and parameters.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for your detailed question. It seems like you're on the right track with using LlamaIndex's Knowledge Graph (KG) for your use case.
For your first two questions, as you mentioned, a simple vector store or even a traditional information retrieval system might suffice. However, for the last two questions, you're correct that a more complex system like a KG would be beneficial due to the need to retrieve multiple related chunks of information.
In terms of the issues you're facing with the KG, it seems like there might be some confusion about how the graph is generated and how triplets are extracted from the text.
The graph is generated based on the triplets that are extracted from the text. Each triplet consists of a subject, predicate, and object, which form the nodes and edges of the graph. The upsert_triplet_and_node
method is used to insert these triplets into the graph. If the graph is not being generated as expected, it might be due to the way the triplets are being extracted or inserted.
The extraction of triplets is handled by the _extract_triplets
method in the KnowledgeGraphIndex
class. This method uses a function (_kg_triplet_extract_fn
) if provided, otherwise it uses the _llm_extract_triplets
method. The _llm_extract_triplets
method uses the LLM predictor to predict triplets from the text based on the kg_triple_extract_template
. The predicted response is then parsed by the _parse_triplet_response
method to extract the triplets.
There are a few limitations to this process that might be causing the issues you're seeing:
The maximum number of triplets that can be extracted per chunk is defined by the max_triplets_per_chunk
parameter. If the text contains more triplets, they will not be extracted.
The maximum length of the object in a triplet is defined by the max_object_length
parameter. If the object in a triplet exceeds this length, the triplet will be skipped.
The extraction process relies on the LLM predictor and the kg_triple_extract_template
. If the predictor or the template is not accurate or comprehensive, the extraction process may not be able to extract all the triplets correctly.
The extraction process only extracts triplets that are properly formatted (i.e., enclosed in parentheses and separated by commas). If a triplet is not properly formatted, it will be skipped.
You can find more details about this process in the KnowledgeGraphIndex class.
As for the image you've shared, I'm sorry but I'm unable to view images. Could you please describe the issue you're facing with the graph in more detail?
I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot So for my point 3 & 4 is this the best approach? I have tried to manually add triplets in graph
node_0_tups = [ ("author", "worked on", "writing"), ("author", "worked on", "programming"), ] for tup in node_0_tups: index.upsert_triplet_and_node(tup, nodes[0])
this is the same code from llama-index I am using. But the generated graph is having a relation like this: the two nodes are author and worked on and in arrow its showing relation as author too. What I have seen is that when inserting triplets the first and second elements is taken as node and the first elements is taken as relation for all cases. Is that the correct method in reality should'nt first and third element be taken as node and the second elements as relation? Also what is the use of inserting nodes[0] here-index.upsert_triplet_and_node(tup, nodes[0])
For creating KG:
@vishnu9000 hmm, I think it's just a bug with drawing the graph tbh, especially since we have unit tests for upserting triplets like that.
@logan-markewich ohh..so is this a bug with llama index? Any way to rectify it? If this is the case then for other KG options in llama index same bug will be there right?
@vishnu9000 could you give it a try on NebulaGraph first?
that's the main graph store I kept changing/iterating/optimizing towards.
Will spend some time polishing the SimpleStore and the mutation of the kg_index later.
@wey-gu Thanks for the info I will check it out. This might be a dumb question but how to create graph from text for nebulagraph? Should it be done manually? or use an llm for extracting entity, wouldn't that take too much time? I have 100 excels with test case id, component, test case name, steps , results how to create a graph entity from this?
Question Validation
Question
Hi guys I am new to llama index and llm. I am working on an use case for document question answering and I am confused on which path to take. I am lot of excel files that contain test case details like component, test case id, test case name, test case steps, expected results. I want to create a QA system such that it should be able to answer following questions:
Here first 2 questions can be done with simple vector store. But last two cannot be achieved by vector store due to limitations of vector store RAG where chunks are save. Let say I said give me test case for y component. The pipeline will retrieve 5 best matched chunk. But I may have 20 such test cases. And for last question the model might need to go to multiple chunks to get the accurate results. In my case for each test case component, test case id, test case name, test case steps, expected results is a chunk so itself will be having a good amount of size.
That when I thought of Knowledge Graph and looked into llama index framework. I am using local llama 2 quantized model for security reasons of data. I have checked out following sections.
I have tried to add triplets manually as in docs node_0_tups = [ ("author", "worked on", "writing"), ("author", "worked on", "programming"), ] for tup in node_0_tups: index.upsert_triplet_and_node(tup, nodes[0])
But this the graph its generating, I am really confused by how the relation ship became like this.
I was hoping KG could help with with all the use case I have listed above. Can some one help me with this. I am stuck. Am I doing anything wrong? Is this not the correct approach?