ncihtan / data-models

Schema.org Data Models for HTAN
MIT License
14 stars 7 forks source link

Bulk WES L3 issues #48

Closed adamjtaylor closed 2 years ago

adamjtaylor commented 2 years ago

Information

Proposed change:

a clear and consice description of proposed data model ammendments

In the Bulk WES level 3 metadata template all germline, somatic, and structural variants workflows are required for each sample. Each of our samples is analyzed by only one workflow, not all three. Leaving the rest empty, NA, or Not Applicable are not allowed for the rest of the workflows, which is making our metadata invalid. Any suggestions to solve this?

Dicussion link:

Link to where the changes have been discussed/approved eg Closed RFC, Service desk ticket, DCC coordination/operations agenda HTAN-35


How important is this feature?

Select from the options below


When will use cases depending on this become relevant?

Select from the options below


Additional context

Add any other context or screenshots about the feature request here Identified from publication review

Implementation checklist

adamjtaylor commented 2 years ago

The following attributes are required in the schema and have valid values set

Each have an "Other" value that enables submission of a custom workflow, but do not have a "None"/"Not used"/"NA"/"Not Applicable" option

Options to fix:

  1. Suggest that they use the custom "Other" value and enter "None" (+) No change in schema required (-) Not intuitive. String is uncontrolled
  2. Add a "Workflow not used" option to the valid values (+) Clarity (-) Minor schema change (?) Would need to agree language.
  3. [SELECTED] Move to single workflow url / workflow type as detailed in RFC table (+) Clean (-) Centers have to enter additional rows per workflow)

@elv-sb what do you think?

adamjtaylor commented 2 years ago

I should actually read the source ticket! Proposed solution was actually more drastic

Ok, thanks! Just like the google sheet. Ah I found a bug in the google sheet, there were two columns with same name. OK fixed. There should be one column called “Workflow Type” where they select the type of data (SNPs, germline, or CNVs), and one column named “Workflow Name” where they put a name to the tool-family used in producing the data. One “Workflow URL” column.

adamjtaylor commented 2 years ago

The ticket proposed solution is what is in the RFC spreadsheet updated by @Gibbsdavidl in Feb. image

Implementing this would assume that only one workflow is run (as per the ticket) Are there cases where multiple workflow types are run for a single Bulk WES L3 file?

Gibbsdavidl commented 2 years ago

Yes, they may run more than one workflow, but the thought was that would produce a separate result and should get it's own metadata. For example, could run a SNP calling workflow and a copy number workflow, and those would be different results and would have different metadata rather than a combined result set. In TCGA copy number and SNP results are separate.

adamjtaylor commented 2 years ago

Thanks @Gibbsdavidl, If we implement the changes per the RFC we should do a quick review on what metadata has been submitted for these attributes in the existing template to get a sense of how many centers will need to resubmit metadata to meet the updated schema

adamjtaylor commented 2 years ago

Looks like we have no Bulk WES level 3 data yet, so we should be good to implement the changes per the RFC

adamjtaylor commented 2 years ago

@elv-sb and I agree that we will action this once the data model to DCA flow is unblocked

adamjtaylor commented 2 years ago

Closing as this has now been implemented by setting the workflows to not be required. Reported back to @vthorsson by email. Can re-open if Vanderbilt still have any template issues with metadata submission

adamjtaylor commented 2 years ago

Re-opening as per HTAN-35,, we also need to add none into the valid workflow values.