monarch-initiative / koza

Data transformation framework for LinkML data models
https://koza.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

Add in flags which enable creation of nodes grouped by source #138

Open DnlRKorn opened 3 months ago

DnlRKorn commented 3 months ago

Posting as a result of KG Construction Crew discussion on July 8, 2024.

The current configuration of Koza is to generate one large TSV file for all nodes parsed from a singular datasource. To help with debugging and certain use cases; having the ability to have output node files for each individual data source could be useful.

In addition to this behavior, adding a flag which could be used to disable the creation of the large node TSV file may also be helpful.

Summary of request:

Please reach out to @hrshdhgd for more details of the advantages of this approach!

kevinschaper commented 3 months ago

What does data source mean in this context? Should we be thinking in terms of just supplying a field or list of fields that would be use to split into separate files?

I don't know if partition is the right term here, but something like:

node_partition:
  fields: 
    - primary_knowledge_source
    - taxon
  write_combined: false
hrshdhgd commented 3 months ago

A data source example would be this one.

We would like to have something like

├── output/
    ├── all-traits-nodes.tsv
    ├── all-traits-edges.tsv
    ├── OrganismTaxonPathways/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxonCarbonSubstrate/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxon Motility/
        ├── nodes.tsv
        ├── edges.tsv

where tax_id => OrganismTaxon (from biolink) This is a very crude example based on column names (filenames could be more informative than just nodes and edges). But the idea is generating all KGs (an all inclusive one and components). I have shown just 3 columns but we would have OrganismXXX XXX being every other column name

The idea is to generate KGs of everything possible w.r.t what's available in the data source. This will allow downstream projects pick and choose either each individual KGs of interest (sort of like building a bouquet of flowers)or the whole thing based on the requirements. Hope this makes sense.

kevinschaper commented 3 months ago

I spent a little time looking at refactoring to go from a single writer to a dict of writers, but it wasn’t the kind of refactor that just easily falls into place. I might start with a cli command to split the files after the ingest, because that’s much more straightforward to implement, and much less likely to break the existing behavior.