run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.29k stars 4.96k forks source link

[Question]: Extract_prompt with DynamicLLMPathExtractor #14975

Open ferdeleong opened 1 month ago

ferdeleong commented 1 month ago

Question Validation

Question

Do you have any examples on how to use extract_prompt in DynamicPathExtractor?

I've only found an example of extract_prompt usage here in SimpleLLMPathExtractor: https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/

When I try to use the exact same prompt as shown in the link above with Dynamic I get this result:

image

If I comment extract_prompt my graph works again but I do need to give more context to the LLM to have proper relationships.

I also saw the knowledge-graphs in default_prompts.py file from the library to custom the prompt so it also follows the allowed entities and relationships defined in kg_extractor but still no success.

I also tried formatting the prompt so extract_prompt receives a PromptTemplate instead of a str but also no success.

dosubot[bot] commented 1 month ago

Hello @ferdeleong! I'm here to help you with your coding questions and bug fixes. Let's work together to solve the issue you're facing.

To use extract_prompt with DynamicLLMPathExtractor, you need to ensure that the prompt is correctly formatted and compatible with the extractor. Here is an example of how to properly set up and use DynamicLLMPathExtractor with extract_prompt:

from typing import Literal
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor
from llama_index.core.prompts import PromptTemplate
from llama_index.core.llms.llm import LLM

# Define your language model
llm = LLM()

# Define your extraction prompt
extract_prompt = PromptTemplate(
    "Extract the relationships between entities in the following text according to the provided schema: {text}"
)

# Define possible entities and relations
entities = Literal["PERSON", "PLACE", "THING"]
relations = Literal["PART_OF", "HAS", "IS_A"]
schema = {
    "PERSON": ["PART_OF", "HAS", "IS_A"],
    "PLACE": ["PART_OF", "HAS"],
    "THING": ["IS_A"],
}

# Initialize the DynamicLLMPathExtractor
kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    extract_prompt=extract_prompt,
    possible_entities=entities,
    possible_relations=relations,
    kg_validation_schema=schema,
    strict=True,
    num_workers=4,
    max_paths_per_chunk=10,
    show_progress=False,
)

# Use the extractor to extract relationships from your data
# Assuming `data` is the text data you want to process
relationships = kg_extractor.extract(data)

In this example:

Make sure that your extract_prompt provides sufficient context for the LLM to establish relationships between nodes. If you encounter issues, ensure that the prompt is correctly formatted and that the schema is properly defined.

For more details on the differences between SimpleLLMPathExtractor and DynamicLLMPathExtractor, you can refer to the documentation [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@ferdeleong you can customize the prompt like so (this is the default)

extract_prompt = (
    "Extract up to {max_knowledge_triplets} knowledge triplets from the given text. "
    "Each triplet should be in the form of (head, relation, tail) with their respective types.\n"
    "---------------------\n"
    "INITIAL ONTOLOGY:\n"
    "Entity Types: {allowed_entity_types}\n"
    "Relation Types: {allowed_relation_types}\n"
    "\n"
    "Use these types as a starting point, but introduce new types if necessary based on the context.\n"
    "\n"
    "GUIDELINES:\n"
    "- Output in JSON format: [{{'head': '', 'head_type': '', 'relation': '', 'tail': '', 'tail_type': ''}}]\n"
    "- Use the most complete form for entities (e.g., 'United States of America' instead of 'USA')\n"
    "- Keep entities concise (3-5 words max)\n"
    "- Break down complex phrases into multiple triplets\n"
    "- Ensure the knowledge graph is coherent and easily understandable\n"
    "---------------------\n"
    "EXAMPLE:\n"
    "Text: Tim Cook, CEO of Apple Inc., announced the new Apple Watch that monitors heart health. "
    "UC Berkeley researchers studied the benefits of apples.\n"
    "Output:\n"
    "[{{'head': 'Tim Cook', 'head_type': 'PERSON', 'relation': 'CEO_OF', 'tail': 'Apple Inc.', 'tail_type': 'COMPANY'}},\n"
    " {{'head': 'Apple Inc.', 'head_type': 'COMPANY', 'relation': 'PRODUCES', 'tail': 'Apple Watch', 'tail_type': 'PRODUCT'}},\n"
    " {{'head': 'Apple Watch', 'head_type': 'PRODUCT', 'relation': 'MONITORS', 'tail': 'heart health', 'tail_type': 'HEALTH_METRIC'}},\n"
    " {{'head': 'UC Berkeley', 'head_type': 'UNIVERSITY', 'relation': 'STUDIES', 'tail': 'benefits of apples', 'tail_type': 'RESEARCH_TOPIC'}}]\n"
    "---------------------\n"
    "Text: {text}\n"
    "Output:\n"
)

extractor = DynamicLLMPathExtractor(..., extract_prompt=extract_prompt)
ferdeleong commented 1 month ago

Thank you for answering, this is all that I've tried:

def extract_entities_relationships(text):
    llm = OpenAI(temperature=0, model="gpt-4o-mini")
    document = Document(text=text, metadata={"title": "Amenities Content"})

    kg_extractor = DynamicLLMPathExtractor(
            llm=llm,
            max_triplets_per_chunk=4,
            num_workers=16,
            allowed_entity_types=None,
            allowed_relation_types=None,
            allowed_relation_props=[],
            allowed_entity_props=[], 
    )

    index = PropertyGraphIndex.from_documents(
            [document],
            llm=llm,
            embed_kg_nodes=False,
            kg_extractors=[kg_extractor],
            show_progress=True,
    )

And that works perfectly fine

image

Then I introduce the following:

extract_prompt = (
    "Extract up to {max_knowledge_triplets} knowledge triplets from the given text. "
    "Each triplet should be in the form of (head, relation, tail) with their respective types.\n"
    "---------------------\n"
    "INITIAL ONTOLOGY:\n"
    "Entity Types: {allowed_entity_types}\n"
    "Relation Types: {allowed_relation_types}\n"
    "\n"
    "Use these types as a starting point, but introduce new types if necessary based on the context.\n"
    "\n"
    "GUIDELINES:\n"
    "- Output in JSON format: [{{'head': '', 'head_type': '', 'relation': '', 'tail': '', 'tail_type': ''}}]\n"
    "- Use the most complete form for entities (e.g., 'United States of America' instead of 'USA')\n"
    "- Keep entities concise (3-5 words max)\n"
    "- Break down complex phrases into multiple triplets\n"
    "- Ensure the knowledge graph is coherent and easily understandable\n"
    "---------------------\n"
    "EXAMPLE:\n"
    "Text: Tim Cook, CEO of Apple Inc., announced the new Apple Watch that monitors heart health. "
    "UC Berkeley researchers studied the benefits of apples.\n"
    "Output:\n"
    "[{{'head': 'Tim Cook', 'head_type': 'PERSON', 'relation': 'CEO_OF', 'tail': 'Apple Inc.', 'tail_type': 'COMPANY'}},\n"
    " {{'head': 'Apple Inc.', 'head_type': 'COMPANY', 'relation': 'PRODUCES', 'tail': 'Apple Watch', 'tail_type': 'PRODUCT'}},\n"
    " {{'head': 'Apple Watch', 'head_type': 'PRODUCT', 'relation': 'MONITORS', 'tail': 'heart health', 'tail_type': 'HEALTH_METRIC'}},\n"
    " {{'head': 'UC Berkeley', 'head_type': 'UNIVERSITY', 'relation': 'STUDIES', 'tail': 'benefits of apples', 'tail_type': 'RESEARCH_TOPIC'}}]\n"
    "---------------------\n"
    "Text: {text}\n"
    "Output:\n"
)
def extract_entities_relationships(text):
    llm = OpenAI(temperature=0, model="gpt-4o-mini")
    document = Document(text=text, metadata={"title": "Amenities Content"})

    kg_extractor = DynamicLLMPathExtractor(
            llm=llm,
            max_triplets_per_chunk=4,
            extract_prompt=extract_prompt, #  NEW LINE
            num_workers=16,
            allowed_entity_types=None,
            allowed_relation_types=None,
            allowed_relation_props=[],
            allowed_entity_props=[], 
    )

    index = PropertyGraphIndex.from_documents(
            [document],
            llm=llm,
            embed_kg_nodes=False,
            kg_extractors=[kg_extractor],
            show_progress=True,
    )

And then it stops working

image

Then I tried using the same prompt but using an entity and relationship instead of None

def extract_entities_relationships(text):
    llm = OpenAI(temperature=0, model="gpt-4o-mini")
    document = Document(text=text, metadata={"title": "Amenities Content"})

    kg_extractor = DynamicLLMPathExtractor(
            llm=llm,
            max_triplets_per_chunk=4,
            extract_prompt=extract_prompt,
            num_workers=16,
            allowed_entity_types=["RESTAURANT"],  #  NEW LINE
            allowed_relation_types=["SIMILAR_TO"],  #  NEW LINE
            allowed_relation_props=[],
            allowed_entity_props=[], 
    )

    index = PropertyGraphIndex.from_documents(
            [document],
            llm=llm,
            embed_kg_nodes=False,
            kg_extractors=[kg_extractor],
            show_progress=True,
    )

image

Then I tried what the bot suggested

extract_prompt = PromptTemplate( # NEW LINE
    "Extract up to {max_knowledge_triplets} knowledge triplets from the given text. "
    "Each triplet should be in the form of (head, relation, tail) with their respective types.\n"
    "---------------------\n"
    "INITIAL ONTOLOGY:\n"
    "Entity Types: {allowed_entity_types}\n"
    "Relation Types: {allowed_relation_types}\n"
    "\n"
    "Use these types as a starting point, but introduce new types if necessary based on the context.\n"
    "\n"
    "GUIDELINES:\n"
    "- Output in JSON format: [{{'head': '', 'head_type': '', 'relation': '', 'tail': '', 'tail_type': ''}}]\n"
    "- Use the most complete form for entities (e.g., 'United States of America' instead of 'USA')\n"
    "- Keep entities concise (3-5 words max)\n"
    "- Break down complex phrases into multiple triplets\n"
    "- Ensure the knowledge graph is coherent and easily understandable\n"
    "---------------------\n"
    "EXAMPLE:\n"
    "Text: Tim Cook, CEO of Apple Inc., announced the new Apple Watch that monitors heart health. "
    "UC Berkeley researchers studied the benefits of apples.\n"
    "Output:\n"
    "[{{'head': 'Tim Cook', 'head_type': 'PERSON', 'relation': 'CEO_OF', 'tail': 'Apple Inc.', 'tail_type': 'COMPANY'}},\n"
    " {{'head': 'Apple Inc.', 'head_type': 'COMPANY', 'relation': 'PRODUCES', 'tail': 'Apple Watch', 'tail_type': 'PRODUCT'}},\n"
    " {{'head': 'Apple Watch', 'head_type': 'PRODUCT', 'relation': 'MONITORS', 'tail': 'heart health', 'tail_type': 'HEALTH_METRIC'}},\n"
    " {{'head': 'UC Berkeley', 'head_type': 'UNIVERSITY', 'relation': 'STUDIES', 'tail': 'benefits of apples', 'tail_type': 'RESEARCH_TOPIC'}}]\n"
    "---------------------\n"
    "Text: {text}\n"
    "Output:\n"
)

Also no success.

Finally with the same PromptTemplate casted prompt I tried:

max_triplets_per_chunk=20,
formatted_prompt = template.format(max_triplets_per_chunk=max_triplets_per_chunk, text=text)

kg_extractor = DynamicLLMPathExtractor(..., extract_prompt=formatted_prompt)

But again, no success.

Any ideas on what may be happening?