Closed Bankso closed 8 months ago
@aditigopalan Thinking about our conversations on minimum annotations for the CCKP, I think we can decrease the number of required columns in our DatasetView manifest. I imagine this will make it easier to upload preliminary annotations and better reflect the concept of "minimum viable annotations" that we intend to provide as preliminary metadata for contributors to review and supplement.
Here's what I think would be reasonable to mark as required: Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias, Dataset Assay, Dataset Species, Dataset URL
And optional: Dataset Pubmed Id, Dataset Description, Dataset Design, Dataset Tumor Type, Dataset Tissue, Dataset File Format
Do the required fields seem feasible as preliminary annotations that get recorded prior to uploading for contributor review? I started adding some changes in this data-models branch: https://github.com/mc2-center/data-models/tree/curation-terms-1-24
@Bankso would it make sense to just make Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias required fields for upload? I think this would make the (initial) annotation and upload process significantly quicker for me! If we decide to keep the other fields, I'm not sure it makes a huge difference wrt annotation time and changes wrt "minimum viable annotations" may not save us too much time. Let me know what you think!
Initial version of script for validating and cleaning the UNION tables is here: https://github.com/mc2-center/mc2-center-dcc/blob/32-tidy-deployment/utils/union_qc.py
So far, the script is written as follows:
input flags: -l
takes a space separated list of table Synapse IDs; -c
takes the path to a schematic config.yml; -m
is a boolean flag that will run the row merge function on each table
Processing steps:
-l
-m
was provided at runtime, using schematic and the MC2 data model, as indicated in config.yml provided via -c
flagOne other thing I wanted to expand on a bit: the CSV outputs from the union_qc script will retain the entityId
column. Instead of retaining a single entityId for merged rows, all entityIds from a merged group will be joined for the resulting entry.
The goal here was to provide the information necessary to directly access the source manifest tables in a grant-based Synapse project, which is something we've discussed in relation to CCKP entries.
Takeaways from discussion on 24-2 sprint kick-off, on 2024.02.02
did I miss anything?
This looks great, thank you for writing this up, @aclayton555 !
Changes to data model noted in this issue are included in PR https://github.com/mc2-center/data-models/pull/70
Metadata validation/database management scripts are addressed by PR #39
Detailed layout of steps is being worked on as part of Issue #36
Quality control/validation should be integrated into the CCKP database update/release process. To help design this infrastructure, here is an outline of the curation and release/portal sync process, focusing on validation points/data quality/completion status management:
Next steps: