monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Refactor Koza writer to remove KGX assumptions #818

Open kevinschaper opened 1 month ago

kevinschaper commented 1 month ago

As an opportunity to move past the implicit Biolink+kgx format assumptions of the current Koza writers, and a way to support writing to multiple output files from a single ingest, I think we should define a new writer configuration based on LinkML models. Supplying a schema and list of classes to a writer, along with an explicit output filename, will handle the challenge of specifying output columns in a dynamic way that is model agnostic (a longer term Koza goal), and less brittle than the current listing of node and edge properties, where a property set in the python but left out of the node/edge properties won't actually be written to the file.

A challenge is the method of specifying the schema. The two initial use cases I'm imagining are writing to biolink node or association classes, or SSSOM associations, and I think in both cases it might make the most sense to pull the model yaml from importlib, so my initial specification is {package}:{model.yaml} which for our standard Koza STRINGDB example looks like:

writers:
  "nodes":
    filename: 'protein_links_nodes.tsv'
    linkml_schema: 'biolink_model.schema:biolink_model.py'
    classes:
      - 'Gene'
  "edges":
    filename: 'protein_links_edges.tsv'
    linkml_schema: 'biolink_model.schema:biolink_model.py'
    classes:
      - 'PairwiseGeneToGeneInteraction'

With the expectation that the python part of the koza transform would change from:

koza.write(gene_a, gene_b, association)

to the slightly more verbose, but specific

koza.write(gene_a, gene_b, writer="nodes")
koza.write(association, writer="edges")

Note: I may walk out of basing this entirely on LinkML, even though that's the big win, because there have been times that we want to export some additional file, maybe for debugging or QC purposes, and in those cases it might be nice to have the option to just specify a list of columns.

sagehrke commented 1 week ago

related to: https://github.com/monarch-initiative/koza/issues/149