Outline and integrate metadata validation processes for CCKP database

Bankso commented 8 months ago

Quality control/validation should be integrated into the CCKP database update/release process. To help design this infrastructure, here is an outline of the curation and release/portal sync process, focusing on validation points/data quality/completion status management:

Pubmed crawler is run to identify new publications, associated datasets, and associated tools
Publications are fully annotated, datasets and tools are annotated to minimum required standard
- validation is performed by schematic prior to Synapse upload, based on the MC2 data model
Contributors access newly uploaded manifests through the DCA
- validation is performed by schematic via DCA, based on the MC2 data model
- Note that this step will stratify metadata into two states: contributor reviewed and not yet reviewed
- DCA submission will automatically check for required entries and validate provided info, if contributor has reviewed
- this could be a point where we implement something slightly flashy - a badge or status marker on the CCKP card to indicate that a contributor has reviewed the metadata
- display a completion percentage associated with their grant/project?
- mark unreviewed entries as unreviewed
All project tables will show all tables stored in MC2 Synapse projects
UNION tables will show all metadata stored in tables of a single metadata component
After designated cutoff (e.g., 2 weeks post ingress of preliminary annotations) pull UNION tables as CSVs validation can be done with schematic
- ensure minimum metadata has been provided, flag errors/invalid terms, verify release date as applicable
- report validation status to new sheet that contains the row identifier (_id) and a go/no go flag
store validation status table in CCKP admin project
syncing script should pull both tables, create a join on _id, and sync only those with TRUE in the go/no go flag attribute

Next steps:

Review and identify points in process that require additional scripts
Determine what a minimal metadata model looks like for each component and if a minimal metadata model will be required for uploading preliminary annotations
Revisit required attributes and adjust requirements if they do not match the minimal model definition

Bankso commented 8 months ago

@aditigopalan Thinking about our conversations on minimum annotations for the CCKP, I think we can decrease the number of required columns in our DatasetView manifest. I imagine this will make it easier to upload preliminary annotations and better reflect the concept of "minimum viable annotations" that we intend to provide as preliminary metadata for contributors to review and supplement.

Here's what I think would be reasonable to mark as required: Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias, Dataset Assay, Dataset Species, Dataset URL

And optional: Dataset Pubmed Id, Dataset Description, Dataset Design, Dataset Tumor Type, Dataset Tissue, Dataset File Format

Do the required fields seem feasible as preliminary annotations that get recorded prior to uploading for contributor review? I started adding some changes in this data-models branch: https://github.com/mc2-center/data-models/tree/curation-terms-1-24

aditigopalan commented 8 months ago

@Bankso would it make sense to just make Component, DatasetView_id, Dataset Grant Number, Dataset Name, Dataset Alias required fields for upload? I think this would make the (initial) annotation and upload process significantly quicker for me! If we decide to keep the other fields, I'm not sure it makes a huge difference wrt annotation time and changes wrt "minimum viable annotations" may not save us too much time. Let me know what you think!

Bankso commented 8 months ago

Initial version of script for validating and cleaning the UNION tables is here: https://github.com/mc2-center/mc2-center-dcc/blob/32-tidy-deployment/utils/union_qc.py

So far, the script is written as follows:

input flags: -l takes a space separated list of table Synapse IDs; -c takes the path to a schematic config.yml; -m is a boolean flag that will run the row merge function on each table

Processing steps:

runs a query to get all rows, columns from tables, as indicated by input -l
saves the downloaded table(s) as CSV
[if -m is provided at run time] combine rows in a downloaded table if the entries within a specific column match one another a. for grant number and entityId, the resulting cell in the combined row will be the comma-separated list of contents contained in all cells belonging to the group b. all other entries are assumed to be redundant, so the script takes the first entry in each group and uses that for the new row c. adds suffix "_merged" (e.g., "DatasetView_merged.csv")
run validation on the downloaded or merged manifest(s), depending on if -m was provided at runtime, using schematic and the MC2 data model, as indicated in config.yml provided via -c flag
store the error and output from validation as txt files
convert the validation output txt file to a CSV, where each error message is a separate row
extract row identifiers from the validation report CSVs, building a list of manifest rows that caused schematic to report an error
trim invalid rows from the manifest(s), adding suffix "_trimmed" (e.g., "DatasetView_trimmed.csv")

Bankso commented 8 months ago

One other thing I wanted to expand on a bit: the CSV outputs from the union_qc script will retain the entityId column. Instead of retaining a single entityId for merged rows, all entityIds from a merged group will be joined for the resulting entry.

The goal here was to provide the information necessary to directly access the source manifest tables in a grant-based Synapse project, which is something we've discussed in relation to CCKP entries.

I imagine the entityIds can be listed with the portal entries or can be incorporated into links that say "Edit this metadata" or something along those lines
Selecting the link will direct people to the grant Synapse project - only those with proper access will be able to see/modify the information
While I think it would be ideal for links to lead directly to the DCA, directing people to Synapse tables could be a useful intermediate/stepping stone towards that functionality
instructions for this metadata update process would differ from the community curation/review, since the metadata entries may not be contained in the most recently uploaded manifest (which is what gets pulled by the DCA)
contributors would access the table in their Synapse project, download the CSV, generate a manifest, add the modified entry to the manifest, validate and upload via the DCA
A different approach could be to establish a link that downloads the grant-specific table (or populates a schematic manifest with the table info?). Contributors could then edit and upload via the DCA. This is more straightforward and has fewer steps, but could be more difficult to implement (though I could be wrong)

aclayton555 commented 8 months ago

Takeaways from discussion on 24-2 sprint kick-off, on 2024.02.02

The errors we are picking up with Tools - there's a good chance that many of these are artifacts from when we did our tool cleanup a few years ago. Following that cleanup, there were a number of tools that were not annotate-able and/or were deemed unsuitable for sharing on the portal. We chose to keep them in our database, but turned off their portal visibility via the portalDisplay column of the Portal - Tools Merged v2 table (https://www.synapse.org/#!Synapse:syn26127427/tables/). This does mean, however, that these tool records were still backpopulated to their parent grant project, are now appearing in our union table, and thus flagging errors such as missing annotations. Suggested that Orion compare the validation report to the Portal - Tools Merged v2 to see which errors actually require any action.
Other errors we are seeing on curated resources - these can probably be divided into 1) quick fixes that we can make and push now, 2) errors that require a data model update, 3) errors that require no action (like the Tools artifacts mentioned above and flagged URLs that are actually fine), and 4) errors for which we have record of but will take no action (e.g. things older than 1 year). For the current portal sync, suggested that we focus on fixes for resources curated in the last year. For older resources, these errors will continue to come up, and we can decide to poke away at these over time (but these have otherwise already been in the portal).
what about a qc report for educational resources? - this is something we will implement, but is not critical for this current portal sync. Orion has been encountering some weird errors in the qc validation for this table, so we can come back to this in our next round of qc checks.
A lot of tables!? So we now have our UNION tables, which brings in all of the entries of a given resource type into a single table (e.g. DatasetView_UNION https://www.synapse.org/#!Synapse:syn52752399/tables/) within our Admin project. Following the QC process, we will have a csv generated (e.g. DatasetView_merged.csv https://www.synapse.org/#!Synapse:syn53461903) which becomes a source of truth for what should be synced to the portal table (e.g. Portal - Datasets Merged https://www.synapse.org/#!Synapse:syn21897968/tables/). So the portal syncing scripts will need to point to the latest QC csv file for each release. Likely do this a bit manually for the current sync, but Verena would like to automate this and leverage the versioning features in Synapse to have the scripts communicate with the latest version.
Longer term, will be good to think about if/how we can leverage the DFA to help with monitoring our release process, but the current sync is the first end-to end pilot of the process with our new infrastructure setup. For now, Verena and Orion are mocking up an issue template to help track data release-related steps: https://github.com/mc2-center/data-models/issues/new/choose

did I miss anything?

Bankso commented 8 months ago

This looks great, thank you for writing this up, @aclayton555 !

Bankso commented 8 months ago

Changes to data model noted in this issue are included in PR https://github.com/mc2-center/data-models/pull/70

Metadata validation/database management scripts are addressed by PR #39

Detailed layout of steps is being worked on as part of Issue #36

mc2-center / mc2-center-dcc

Outline and integrate metadata validation processes for CCKP database #37