Open DnlRKorn opened 3 months ago
What does data source mean in this context? Should we be thinking in terms of just supplying a field or list of fields that would be use to split into separate files?
I don't know if partition
is the right term here, but something like:
node_partition:
fields:
- primary_knowledge_source
- taxon
write_combined: false
A data source example would be this one.
We would like to have something like
├── output/
├── all-traits-nodes.tsv
├── all-traits-edges.tsv
├── OrganismTaxonPathways/
│ ├── nodes.tsv
│ ├── edges.tsv
├── OrganismTaxonCarbonSubstrate/
│ ├── nodes.tsv
│ ├── edges.tsv
├── OrganismTaxon Motility/
├── nodes.tsv
├── edges.tsv
where tax_id
=> OrganismTaxon
(from biolink
)
This is a very crude example based on column names (filenames could be more informative than just nodes
and edges
). But the idea is generating all KGs (an all inclusive one and components). I have shown just 3 columns but we would have OrganismXXX
XXX being every other column name
The idea is to generate KGs of everything possible w.r.t what's available in the data source. This will allow downstream projects pick and choose either each individual KGs of interest (sort of like building a bouquet of flowers)or the whole thing based on the requirements. Hope this makes sense.
I spent a little time looking at refactoring to go from a single writer to a dict of writers, but it wasn’t the kind of refactor that just easily falls into place. I might start with a cli command to split the files after the ingest, because that’s much more straightforward to implement, and much less likely to break the existing behavior.
Posting as a result of KG Construction Crew discussion on July 8, 2024.
The current configuration of Koza is to generate one large TSV file for all nodes parsed from a singular datasource. To help with debugging and certain use cases; having the ability to have output node files for each individual data source could be useful.
In addition to this behavior, adding a flag which could be used to disable the creation of the large node TSV file may also be helpful.
Summary of request:
Please reach out to @hrshdhgd for more details of the advantages of this approach!