run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.72k stars 5.26k forks source link

[Question]: SchemaLLMPathExtractor lacks structure for relationship directions #14324

Closed prrao87 closed 4 months ago

prrao87 commented 4 months ago

Question Validation

Question

Hello,

I'm opening this issue in relation to #14259, and because it's a question on larger generalization of PropertyGraphIndex to allow multiple graph stores and the imposition of more structure in the allowed relationships, I figured this would be a better place for it than discord.

I'm currently trying to integrate Kùzu, a structured property graph store with the new PropertyGraphIndex, and the PR #14259 that integrates TiDB seems to have some degree of similarity with my tool in terms of having structured tables as prerequisites.

The default arguments for SchemaLLMPathExtractor are not very useful and any realistic scenario would require a custom user-defined schema, where the user specifies possible_entities and possible_relations. Basically, with the following arguments.

from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

schema_path_extractor = SchemaLLMPathExtractor(
    llm=extraction_llm,
    possible_entities=entities,
    possible_relations=relations,
    kg_validation_schema=validation_schema,
    strict=True,  # if false, will allow triples outside of the schema
)

From my perspective, I'm stuck with the lack of structure in the current template schema, as the user only specifies which relationships are possible, but what a structured graph store actually requires is that the direction of the relationships be known beforehand.

Rather than going too deep into designing my own Pydantic schema to deal with this, I thought I'd chime in here and check in with the team here on whether we can come up with a more general solution for more structured graph stores. The current approach seems too tailored for Neo4j's schema definition style, which is understandably flexible, but a lot of other graph stores (like TiDB and Kùzu, maybe even Nebula) are more structured and it would be far more generalizable if we could provide something like the below during schema definition (so that I can specify the FROM and TO relationship directions in Kùzu beforehand).

from typing import Literal

entities = Literal["PERSON", "ORGANIZATION", "LOCATION"]
relations = Literal[
    "IS_CEO_OF",
    "IN_LOCATION",
    "MANAGES",
]

# Define relationship validation schema as a list of triples (provides direction information)
validation_schema = [
    ("PERSON", "IS_CEO_OF", "ORGANIZATION"),
    ("PERSON", "MANAGES", "PERSON"),
    ("ORGANIZATION", "IN_LOCATION", "LOCATION"),
]

Looking forward to learning more from the maintainers, or from others in the community on this. Thanks in advance!

EDIT: To make it a list of triples per Logan's comment.

logan-markewich commented 4 months ago

Yea i agree, but also seems like a pretty easy change to make.

I think your example is almost there. It probably needs to be a list of dict

prrao87 commented 4 months ago

Thanks! I updated the example. But I'm open to other ideas in terms of how to constrain the relationships in the validation schema. Should the key be the relationship name, or should it be the node label (as it is currently) - but with the added ability to specify the from and to directions?

logan-markewich commented 4 months ago

Personally I think having the key as the relationship type make sense 🤔 It could also just be a straight list of triples too tbh, which would be easier to define: [(LABEL1, REL1, LABEL2), ...]

prrao87 commented 4 months ago

I think I agree too - the relationships are easily interpretable as a list of triples. Even if there are multiple node label pairs with the same kind of relationship, it's still better to explicitly state the triples and avoid nested JSON for readability reasons.

logan-markewich commented 4 months ago

I'm on board with this change. Happy to merge if you open a PR. Just need to update any docs + this specific extractor.

I think just for backwards compatibility sake, we might need to detect the old format. But, not too bad.

prrao87 commented 4 months ago

Will get on it and send two PRs - one for the extractor itself and another one for the Kùzu integreation. Will keep you posted. Thanks!

mphipps2 commented 4 months ago

But the validation schema itself is only to prune invalid triplets after generation, right? So I think the better more general solution @prrao87 was getting at is to update this schema in the function call itself.

If I'm understanding correctly, just updating the validation schema will lead to pretty aggressive pruning and risks missing important relationships (depending how large you set max_triplets_per_chunk)

prrao87 commented 4 months ago

just updating the validation schema will lead to pretty aggressive pruning and risks missing important relationships (depending how large you set max_triplets_per_chunk)

Could you give an example of how important relationships can be missed? I'm testing this on a variety of cases and it seems to be capturing them all with max_triplets_per_chunk=10.

mphipps2 commented 4 months ago

Maybe some of it is me not having tuned the entities and relationships much, but at the moment I have about 15 different entities with 10 different relations. Without defining strict from/to conditions I end up with things like ("Person", "attended", "client"). Rather than ("Person", "attended", "meeting"). So this is something that would be pruned when what I really want is to have it included but as ("Person", "attended", "meeting")

prrao87 commented 4 months ago

@mphipps2 could you take a look at my current progress in #14357? I'm seeing decent results in cases like yours where an invalid or missing relationship is correctly skipped. If there are any edge cases there, please feel free to chime in.

prrao87 commented 4 months ago

So this is something that would be pruned when what I really want is to have it included but as ("Person", "attended", "meeting")

I think this is a case of the LLM extraction and not to do with the schema constraints? In the example ("Person", "attended", "client"), that's an invalid triple extracted by the LLM and schema validation can't help you much there.

mphipps2 commented 4 months ago

That was my point though. Testing with your updates now but ideally we'd be passing the "from"/"to" constraints directly to the LLM

prrao87 commented 4 months ago

I think that's a larger issue outside the scope of what I raised here. But it's worth revisiting for @logan-markewich and any others who may have ideas on this in the future.