[Feature Request]: Define allowed entity and relation types for KnowledgeGraphIndex

valentinbuc commented 9 months ago

Feature Description

Currently, llama_index.KnowledgeGraphIndex.from_documets() can be used to extract triples from documents and construct a KG index. It would be great if it would be possible to define allowed edge and relation types as input arguments, for instance edge_types=["born_in", "educated_at", "received", "advisor_to"], and entity_types = node_types=["person", "university", "degree", "award", "institution", "location"].

I have searched the documentation but could not find anything. Please let me know if I missed something :)

Reason

None of the current LlamaIndex functionality currently allows this to my knowledge. I have not yet found an easy workaround.

This feature could probably be implemented with some clever prompt engineering and rule-based output verification. Ideally, a feedback loop could be integrated such that the LLM is told if it generates an undesired relation or entity type and correct itself.

Value of Feature

LlamaIndex currently extracts any kind of features the LLM considers relevant, but these are rather noisy. Imagine I have a selection of news reports and want to extract a KG focusing only on which public figures have interacted with each other. To my knowledge there is no way of doing so using LlamaIndex or any other tool at the moment.
Given I already have a KG with a well-defined schema, I would like to enrich it with triples extracted from documents. While doing so, I would like to only add relation and edge types from the current schema. The relations would of course also need to be called the same way, for example we would want the LLM to generate the relation educated_at rather than received_education_from.

dosubot[bot] commented 9 months ago

🤖

It's great to see your interest in enhancing the functionality of LlamaIndex! Your proposed feature to define allowed entity and relation types for KnowledgeGraphIndex would indeed be a valuable addition. This could significantly improve the precision and relevance of the extracted knowledge graph, especially in scenarios where specific types of relations and entities are required.

Your insights into the potential value of this feature are well articulated, and it's clear that such functionality could address important use cases, such as extracting a KG focusing only on specific interactions between public figures.

I would encourage you to consider contributing to LlamaIndex by creating a pull request with your proposed solution. Your input and expertise in prompt engineering and rule-based output verification could be instrumental in implementing this feature.

Thank you for sharing your thoughts and ideas for improving LlamaIndex!

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

hatianzhang commented 9 months ago

@valentinbuc thanks. that is great feedback. we are looking into KG improvement. Let me check whether we can add something quick for edge_types and entity_types cc @logan-markewich

valentinbuc commented 9 months ago

Thank you! Note that there is also already a discord thread on this, and we identified two possible workarounds:

Using a custom kg_triplet_extract_fn. However, this of course requires a function that can perform the necessary task.
Adjusting the DEFAULT_KG_TRIPLET_EXTRACT_PROMPT by instructing it to only focus on certain edge and entity types. However, another problem here is that GPT4 does not seem capable to follow these instructions, and does not even manage to stick to entities. This may require a more complex setup with Multi Agent LLMs to get working.

REBEL and DREEAM are other good options for triple extraction, however they are limited to a certain set of relations unless you want to annotate data and train a model yourself.

dosubot[bot] commented 6 months ago

Hi, @valentinbuc, I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale.

It looks like the issue you raised requested the ability to define allowed entity and relation types as input arguments for the llama_index.KnowledgeGraphIndex.from_documents() function. There has been some discussion around this feature, with support expressed for its potential to improve the precision and relevance of the extracted knowledge graph. It seems that the feature has been implemented, allowing for more control over the extracted triples and enabling the enrichment of a well-defined knowledge graph schema with relevant triples extracted from documents.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

run-llama / llama_index