Open Bankso opened 4 months ago
in 24-3, at least first two bullets are readily do-able. Third bullet might be more difficult so need to scope this further and see how far we can get.
Currently reviewing components, attributes, and valid values
For hierarchy/structure, I did some preliminary analysis with GPT and ontology scoping, documented here: https://docs.google.com/document/d/1Vs-X4laTfih2YpoouF0njCCSQmcC4AIpsmgmCdl5b9c/edit?usp=sharing
Summary: it seems doable, but it will be a lot of work. To help minimize effort required, I'll source from existing ontologies for structure and devise mappings when needed.
In terms of implementation, I think defining pair-wise relationships will be sufficient, since the information will be carried forward in each mapping. A generic example would be:
Take five terms: RNA-seq, scRNA-seq, ATAC-seq, scATAC-seq, WGS Highest level group: Genomic technique Possible second level groups: bulk, single-cell, transcriptomics, epigenomics, RNA, DNA (lots of options, is the point)
Organizing terms would occur in a CSV, using the column names: Technique (should replace assay), Parent, [all other info captured]
Then relationships are easy to define and structure is easily inferred, using Genomic --> bulk, single-cell --> RNA-seq, scRNA-seq, ATAC-seq, scATAC-seq, WGS
Technique, Parent Genomic, None Bulk, Genomic Single-cell, Genomic RNA-seq, Bulk ATAC-seq, Bulk WGS, Bulk scRNA-seq, Single-cell scATAC-seq, Single-cell . . .
Suggest to chat with ANV to see how this was designed and implemented in NF
Relative to https://github.com/mc2-center/data-models/issues/49 and https://github.com/mc2-center/data-models/pull/66
Draft of the MC2 data model dictionary, using GitHub pages deployment, is here: https://mc2-center.github.io/data-models/
Potential actions that could improve documentation quality (should determine necessity/priority for the following):
Component
andAttribute
entries and add descriptions where missing/incompleteValid Values
and add descriptions, ontology referencesValid Values