monarch-initiative / koza

Data transformation framework for LinkML data models
https://koza.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
48 stars 4 forks source link

Raise an error if an entity property isn't defined in node/edge properties #87

Open kevinschaper opened 2 years ago

kevinschaper commented 2 years ago

It's really easy to have a mismatch between the edge or node properties defined in source config yaml and what properties are actually written in the transform code.

Having a property defined that isn't used isn't really a big deal, but using a property that wasn't defined results in a silent failure of that property just being omitted.

Instead of silent omission, the writer code should raise an error after converting to a dictionary if any of the keys isn't listed in the output properties.

kevinschaper commented 1 year ago

I actually ran into trouble from this recently, definitely something we should fix.

glass-ships commented 1 year ago

This turns out to be kind of a pain because of how biolink pydantic dataclasses work.

One option we considered was running a check in the writer at each row:

if not set(record.keys()).issubset(self.node_properties):
    missing = set(record.keys()).difference(self.node_properties)
    raise ValueError("Node properties missing from config: " + ", ".join(missing))

...
if not set(record.keys()).issubset(self.edge_properties):
    missing = set(record.keys()).difference(self.edge_properties)
    raise ValueError("Edge properties missing from config: " + ", ".join(missing))

but this may be very not performant. It may also just not work, because biolink dataclasses are instantiated with all attributes set, whether to None or a value

amc-corey-cox commented 3 weeks ago

I decided to abstract a bit of the functionality up the abstract KozaWriter class to add this functionality at a higher level. I kind of like what I've done but it may not be the best approach. Please take a look at what I've done at #154