Data model content updates to support GH docs

Bankso commented 9 months ago

Relative to https://github.com/mc2-center/data-models/issues/49 and https://github.com/mc2-center/data-models/pull/66

Draft of the MC2 data model dictionary, using GitHub pages deployment, is here: https://mc2-center.github.io/data-models/

Potential actions that could improve documentation quality (should determine necessity/priority for the following):

review all Component and Attribute entries and add descriptions where missing/incomplete
review Valid Values and add descriptions, ontology references
implement hierarchy for Valid Values

aclayton555 commented 8 months ago

in 24-3, at least first two bullets are readily do-able. Third bullet might be more difficult so need to scope this further and see how far we can get.

Bankso commented 7 months ago

Currently reviewing components, attributes, and valid values

For hierarchy/structure, I did some preliminary analysis with GPT and ontology scoping, documented here: https://docs.google.com/document/d/1Vs-X4laTfih2YpoouF0njCCSQmcC4AIpsmgmCdl5b9c/edit?usp=sharing

Summary: it seems doable, but it will be a lot of work. To help minimize effort required, I'll source from existing ontologies for structure and devise mappings when needed.

In terms of implementation, I think defining pair-wise relationships will be sufficient, since the information will be carried forward in each mapping. A generic example would be:

Take five terms: RNA-seq, scRNA-seq, ATAC-seq, scATAC-seq, WGS Highest level group: Genomic technique Possible second level groups: bulk, single-cell, transcriptomics, epigenomics, RNA, DNA (lots of options, is the point)

Organizing terms would occur in a CSV, using the column names: Technique (should replace assay), Parent, [all other info captured]

Then relationships are easy to define and structure is easily inferred, using Genomic --> bulk, single-cell --> RNA-seq, scRNA-seq, ATAC-seq, scATAC-seq, WGS

Technique, Parent Genomic, None Bulk, Genomic Single-cell, Genomic RNA-seq, Bulk ATAC-seq, Bulk WGS, Bulk scRNA-seq, Single-cell scATAC-seq, Single-cell . . .

aclayton555 commented 6 months ago

Suggest to chat with ANV to see how this was designed and implemented in NF

aclayton555 commented 5 months ago

Continue working through building this out
For search purposes, FTS may not necessitate this hierarchy, so that use case can be deprioritized until we know if FTS is favourable.

aclayton555 commented 4 months ago

24-6: No updates this sprint. Carry into next sprint

Bankso commented 4 months ago

I will continue to collate valid value definitions here for assays, tissues, and tumor types here: https://docs.google.com/spreadsheets/d/1YL8kDB_tdvGDYqDy4x8zlBauLPDxc24W4tLEArJh0kQ/edit?usp=sharing

In addition, there are many valid value sets that are missing descriptions/definitions, like file formats, licenses, input/output formats, etc. Next step here is identify all value types that would benefit from this exercise and note them here.

aclayton555 commented 2 months ago

24-7/8 close out: have new models add (per #115 )

aclayton555 commented 2 months ago

24-9: Secondary to site visit priorities.

Will require some work to add new components. Might be some room for automation to help pull this information easier as the data model updates

aclayton555 commented 1 week ago

Combine this work with: https://github.com/mc2-center/data-models/issues/142

aclayton555 commented 1 week ago

24-11/12 Blocked until tech writer contract in place.

mc2-center / data-models

Data model content updates to support GH docs #67