monarch-initiative / koza

Data transformation framework for LinkML data models
https://koza.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

Add support for multiple writers #140

Open kevinschaper opened 3 months ago

hrshdhgd commented 3 months ago

OK, so here's my first stab. I split nodes by node_category and edges by subject_category + edge_category + object_category.

Here's a unit test to run to see what's going on: https://github.com/monarch-initiative/koza/blob/4e0ec24e8d27a1fff2fbc965e758efb49a07e550/tests/unit/test_tsvwriter_node_and_edge.py#L39-L107

So I added a flag parameter split:bool [default = False] to the write() function

When this flag is True, the entities will generate 6 files:

I deliberately didn't provide subject and object categories in some examples just to show how the splits would look like. This would (hopefully) encourage KG builders to abide by a standard (Biolink) to categorize their nodes. We could enforce usage of Biolink categories (maybe via pydantic?) but I'm not sure if we want to do that.

Also not quite sure how to implement this in the JSONWriter but we'll worry about it once we finalize this.

Thoughts?

cc: @kevinschaper @justaddcoffee @caufieldjh @DnlRKorn @sierra-moxon @amc-corey-cox

kevinschaper commented 3 months ago

I was thinking more generic for this feature, something like:

a = Association(....) 
koza_app.write(a)
if (row["score"] > .8):
  koza_app.write(a, name="filtered")

and then it would produce something like string_edges.tsv and string_filtered_edges.tsv - which would allow the most control for things like filtering or splitting based on columns that aren't necessarily represented in the output columns.

Which would also allow for

a = Association(....) 
koza_app.write(a)
koza_app.write(a, name=a.subject_taxon)
hrshdhgd commented 3 months ago

if (row["score"] > .8): koza_app.write(a, name="filtered")

This is assuming there is a score. How/where is this score calculated?

Sorry in advance, I do not follow string_* concept. Our you help me with an example ? Maybe test code explaining/demonstrating this?

caufieldjh commented 3 months ago

In this example with STRING the score is provided with the data, e.g.:

protein1 protein2 combined_score
493.BWD07_00005 493.BWD07_05105 227
493.BWD07_00005 493.BWD07_03880 221
493.BWD07_00005 493.BWD07_08685 317
493.BWD07_00005 493.BWD07_05905 232
493.BWD07_00005 493.BWD07_06110 174
493.BWD07_00005 493.BWD07_02170 451
493.BWD07_00005 493.BWD07_07175 150
493.BWD07_00005 493.BWD07_01790 161
493.BWD07_00005 493.BWD07_05145 168

where each row is a single protein-protein interaction pair. In current practice, we'd just include that filter and the only associations written to the Koza output would be the filtered ones. But if we wanted to retain both the filtered and unfiltered interactions, as per Kevin's example above, Koza would just treat those as different subsets with their own outputs.

kevinschaper commented 3 months ago

Right, in the transform python for an ingest, you’d be able to koza_app.write(a,name=“any_string”) and that subset of output entities would be written to a file with “_anystring” inserted in between the ingest name and _edges.tsv or _nodes.tsv

hrshdhgd commented 3 months ago

Sorry .... still confused and pardon my ignorance ... I have a few questions. I may have misunderstood this whole concept.

If this is the case:

The bigger question:

caufieldjh commented 3 months ago

No worries, I think we're still figuring out many of the details of how this whole system can/should work.

  • how does this make an ingest modular?

Many of our ingests are essentially one to many: one data file, many different types of entities and relationships. Our KGs rarely need to include all of these potential components, and that's partially because we already do the work of modeling everything as nodes vs. edges, so that eliminates a whole bunch of other ways we could be modeling the data (e.g., I could try to model everything in a pure RDF approach and make everything a triple - not an invalid approach, but not what we're doing). For a KG like Monarch there's also the assumption that node data will have a single source of truth, but not every KG works that way; some may merge node properties from multiple sources. So if we have a way to separate ingested data based on its component parts, we have a way to produce reusable data modules.

  • Do we predict these if conditions to break out the KGs into components?

We can't predict them all, but we can make it as easy as possible to modify existing transforms. For Koza's purposes, this just means supporting very generic splits, and then it's just a matter of having the "core" transform be the broadest possible interpretation of the data (I think Koza already assumes this, because if I fail to include one of the column names in an input file within the transform config then it raises an error)

  • Expect the user to make that call and introduce the 'if' condition as per their needs and rerun the transform?

Yes indeed

  • Are we providing the component KGs or expecting the user to generate them as per their requirement?

Current plans are to provide the component nodes and edges along with their transform module, so if the parts already work for a given use case, they can just be used as-is, no changes needed. The user would still be expected to do the final merge.

realmarcin commented 3 months ago

Catching up here ... probably an ingest case that requires numeric data interpretation like STRING is not the best example to start with because will involve either 1) ingesting all the data but in the end not usable without further work and decisions or 2) an arbitrary slice of the data that will be difficult to agree on.

But what I wanted to ask first, was that that the nascent strategy above seems to introduce an extra transform step. That is, say that a CHEBI transform exists and we just want the 'antiviral' subset of CHEBI -- one would grab the bulk CHEBI transform from the right repo and then would need extra steps to filter/subset that source into a specific KG project. Am I interpreting this correctly? I know CHEBI is also not a great example because reference ontology transforms already exist in KG-OBO. We have a similar case with subsetting the NCBITaxonomy.

A better example to talk about would be BacDive -- this is a rich, complex, mostly standardized source. We have been working on ingesting various aspects of this dataset over the last year or so -- and are about 70% done. But due to the breadth and complexity of the data, there are also other analyses or interpretations or augmentations of BacDive that we've ingested. In the end, getting a 100% of this data ingested is a huge lift and even out of scope. So I wanted to throw this example into the mix, perhaps as a bit of an edge case. How could a partial ingest of a valuable data source live in this new modular universe? It seems the wrong direction to prevent ingestion of a source because 100% of it is not available...

One solution to the out of scope ingest could be to somehow represent the data selection and modeling decisions in a machine readable way. I think this is going to be an important piece of the modularity ... the transparency side of it to help make ingest/selection decisions.

kevinschaper commented 3 months ago

I think a benefit that we still get from splitting apart into a single repo for each source or file from a source is that even if we subset for practicality, all of the machinery is in place to produce alternate subsets or expand to different parts of a file/source that are initially passed over.

An example I have is the alliance disease association ingest which includes non-human gene to human disease associations that are inferred via orthology. I don’t want to bring those edges into monarch-kg enough to figure out how to model them in biolink, so I’m excluding them, but if somebody needs them and wants to sort out the modeling, it’s just a small PR an existing repo.

I would love for koza to have an all declarative mode using linkml-map syntax, so that transforms that don’t actually need custom python logic can just be expressed in yaml, so that it would be naturally machine readable. Maybe kgx stats like metadata about each file would be a good way to document descriptively though. Our minimal start on that in our cookie cutter was a little report tsv table for nodes and edges to give counts by category, taxon etc.

cmungall commented 2 months ago

I like the

koza_app.write(a, name="filtered")

idea.

probably want it to be a list since any edge (or node) could be part of multiple modules/subsets

koza_app.write(a, name=["filtered", ...])