Open kevinschaper opened 2 years ago
I actually ran into trouble from this recently, definitely something we should fix.
This turns out to be kind of a pain because of how biolink pydantic dataclasses work.
One option we considered was running a check in the writer at each row:
if not set(record.keys()).issubset(self.node_properties):
missing = set(record.keys()).difference(self.node_properties)
raise ValueError("Node properties missing from config: " + ", ".join(missing))
...
if not set(record.keys()).issubset(self.edge_properties):
missing = set(record.keys()).difference(self.edge_properties)
raise ValueError("Edge properties missing from config: " + ", ".join(missing))
but this may be very not performant. It may also just not work, because biolink dataclasses are instantiated with all attributes set, whether to None
or a value
I decided to abstract a bit of the functionality up the abstract KozaWriter class to add this functionality at a higher level. I kind of like what I've done but it may not be the best approach. Please take a look at what I've done at #154
It's really easy to have a mismatch between the edge or node properties defined in source config yaml and what properties are actually written in the transform code.
Having a property defined that isn't used isn't really a big deal, but using a property that wasn't defined results in a silent failure of that property just being omitted.
Instead of silent omission, the writer code should raise an error after converting to a dictionary if any of the keys isn't listed in the output properties.