Closed prrao87 closed 4 months ago
Yea i agree, but also seems like a pretty easy change to make.
I think your example is almost there. It probably needs to be a list of dict
Thanks! I updated the example. But I'm open to other ideas in terms of how to constrain the relationships in the validation schema. Should the key be the relationship name, or should it be the node label (as it is currently) - but with the added ability to specify the from
and to
directions?
Personally I think having the key as the relationship type make sense 🤔 It could also just be a straight list of triples too tbh, which would be easier to define: [(LABEL1, REL1, LABEL2), ...]
I think I agree too - the relationships are easily interpretable as a list of triples. Even if there are multiple node label pairs with the same kind of relationship, it's still better to explicitly state the triples and avoid nested JSON for readability reasons.
I'm on board with this change. Happy to merge if you open a PR. Just need to update any docs + this specific extractor.
I think just for backwards compatibility sake, we might need to detect the old format. But, not too bad.
Will get on it and send two PRs - one for the extractor itself and another one for the Kùzu integreation. Will keep you posted. Thanks!
But the validation schema itself is only to prune invalid triplets after generation, right? So I think the better more general solution @prrao87 was getting at is to update this schema in the function call itself.
If I'm understanding correctly, just updating the validation schema will lead to pretty aggressive pruning and risks missing important relationships (depending how large you set max_triplets_per_chunk)
just updating the validation schema will lead to pretty aggressive pruning and risks missing important relationships (depending how large you set max_triplets_per_chunk)
Could you give an example of how important relationships can be missed? I'm testing this on a variety of cases and it seems to be capturing them all with max_triplets_per_chunk=10
.
Maybe some of it is me not having tuned the entities and relationships much, but at the moment I have about 15 different entities with 10 different relations. Without defining strict from/to conditions I end up with things like ("Person", "attended", "client"). Rather than ("Person", "attended", "meeting"). So this is something that would be pruned when what I really want is to have it included but as ("Person", "attended", "meeting")
@mphipps2 could you take a look at my current progress in #14357? I'm seeing decent results in cases like yours where an invalid or missing relationship is correctly skipped. If there are any edge cases there, please feel free to chime in.
So this is something that would be pruned when what I really want is to have it included but as ("Person", "attended", "meeting")
I think this is a case of the LLM extraction and not to do with the schema constraints? In the example ("Person", "attended", "client")
, that's an invalid triple extracted by the LLM and schema validation can't help you much there.
That was my point though. Testing with your updates now but ideally we'd be passing the "from"/"to" constraints directly to the LLM
I think that's a larger issue outside the scope of what I raised here. But it's worth revisiting for @logan-markewich and any others who may have ideas on this in the future.
Question Validation
Question
Hello,
I'm opening this issue in relation to #14259, and because it's a question on larger generalization of
PropertyGraphIndex
to allow multiple graph stores and the imposition of more structure in the allowed relationships, I figured this would be a better place for it than discord.I'm currently trying to integrate Kùzu, a structured property graph store with the new
PropertyGraphIndex
, and the PR #14259 that integrates TiDB seems to have some degree of similarity with my tool in terms of having structured tables as prerequisites.The default arguments for
SchemaLLMPathExtractor
are not very useful and any realistic scenario would require a custom user-defined schema, where the user specifiespossible_entities
andpossible_relations
. Basically, with the following arguments.From my perspective, I'm stuck with the lack of structure in the current template schema, as the user only specifies which relationships are possible, but what a structured graph store actually requires is that the direction of the relationships be known beforehand.
Rather than going too deep into designing my own Pydantic schema to deal with this, I thought I'd chime in here and check in with the team here on whether we can come up with a more general solution for more structured graph stores. The current approach seems too tailored for Neo4j's schema definition style, which is understandably flexible, but a lot of other graph stores (like TiDB and Kùzu, maybe even Nebula) are more structured and it would be far more generalizable if we could provide something like the below during schema definition (so that I can specify the
FROM
andTO
relationship directions in Kùzu beforehand).Looking forward to learning more from the maintainers, or from others in the community on this. Thanks in advance!
EDIT: To make it a list of triples per Logan's comment.