mc2-center / mc2-center-dcc

Data coordination resources for CCKP (and MC2 in general)
0 stars 0 forks source link

Outline and integrate metadata validation processes for CCKP database #37

Closed Bankso closed 8 months ago

Bankso commented 8 months ago

Quality control/validation should be integrated into the CCKP database update/release process. To help design this infrastructure, here is an outline of the curation and release/portal sync process, focusing on validation points/data quality/completion status management:

Next steps:

Bankso commented 8 months ago

@aditigopalan Thinking about our conversations on minimum annotations for the CCKP, I think we can decrease the number of required columns in our DatasetView manifest. I imagine this will make it easier to upload preliminary annotations and better reflect the concept of "minimum viable annotations" that we intend to provide as preliminary metadata for contributors to review and supplement.

Here's what I think would be reasonable to mark as required: Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias, Dataset Assay, Dataset Species, Dataset URL

And optional: Dataset Pubmed Id, Dataset Description, Dataset Design, Dataset Tumor Type, Dataset Tissue, Dataset File Format

Do the required fields seem feasible as preliminary annotations that get recorded prior to uploading for contributor review? I started adding some changes in this data-models branch: https://github.com/mc2-center/data-models/tree/curation-terms-1-24

aditigopalan commented 8 months ago

@Bankso would it make sense to just make Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias required fields for upload? I think this would make the (initial) annotation and upload process significantly quicker for me! If we decide to keep the other fields, I'm not sure it makes a huge difference wrt annotation time and changes wrt "minimum viable annotations" may not save us too much time. Let me know what you think!

Bankso commented 8 months ago

Initial version of script for validating and cleaning the UNION tables is here: https://github.com/mc2-center/mc2-center-dcc/blob/32-tidy-deployment/utils/union_qc.py

So far, the script is written as follows:

input flags: -l takes a space separated list of table Synapse IDs; -c takes the path to a schematic config.yml; -m is a boolean flag that will run the row merge function on each table

Processing steps:

  1. runs a query to get all rows, columns from tables, as indicated by input -l
  2. saves the downloaded table(s) as CSV
  3. [if -m is provided at run time] combine rows in a downloaded table if the entries within a specific column match one another a. for grant number and entityId, the resulting cell in the combined row will be the comma-separated list of contents contained in all cells belonging to the group b. all other entries are assumed to be redundant, so the script takes the first entry in each group and uses that for the new row c. adds suffix "_merged" (e.g., "DatasetView_merged.csv")
  4. run validation on the downloaded or merged manifest(s), depending on if -m was provided at runtime, using schematic and the MC2 data model, as indicated in config.yml provided via -c flag
  5. store the error and output from validation as txt files
  6. convert the validation output txt file to a CSV, where each error message is a separate row
  7. extract row identifiers from the validation report CSVs, building a list of manifest rows that caused schematic to report an error
  8. trim invalid rows from the manifest(s), adding suffix "_trimmed" (e.g., "DatasetView_trimmed.csv")
Bankso commented 8 months ago

One other thing I wanted to expand on a bit: the CSV outputs from the union_qc script will retain the entityId column. Instead of retaining a single entityId for merged rows, all entityIds from a merged group will be joined for the resulting entry.

The goal here was to provide the information necessary to directly access the source manifest tables in a grant-based Synapse project, which is something we've discussed in relation to CCKP entries.

aclayton555 commented 8 months ago

Takeaways from discussion on 24-2 sprint kick-off, on 2024.02.02

did I miss anything?

Bankso commented 8 months ago

This looks great, thank you for writing this up, @aclayton555 !

Bankso commented 8 months ago

Changes to data model noted in this issue are included in PR https://github.com/mc2-center/data-models/pull/70

Metadata validation/database management scripts are addressed by PR #39

Detailed layout of steps is being worked on as part of Issue #36