Develop and implement HTAN/CDS seq template

aclayton555 commented 2 months ago

Historically, we have leveraged the "Other Assay" template as a catch all for data types for which we do not have an assay-specific RFC and component yet available in our data model. This has allowed contributors to proceed with data submission and annotation under "Other assay." At a subsequent date, when the RFC has been performed and the assay-specific component has been implemented in the data model, the HTAN DCC has re-engaged with contributors to update their annotations from the "Other assay" template to the respective assay-specific template.

As we approach the end of HTAN 1.0, we have data types that need to be submitted for which we currently do not have templates in place (e.g. bulk ATACseq). We could have these submitted using the "Other Assay" template, however, we are trying to move away from the use of this template to the extent possible through the end of HTAN 1.0 and have as much data as possible annotated according to the assay-specific template. Furthermore, the "Other Assay" template does not provide sufficient information to allow for mapping and transfer to CDS.

This ticket emerges around the idea of developing a data level-agnostic, minimal sequencing data template as a catch all for remaining expected sequencing data. This template would capture relevant metadata per the existing HTAN data model AND be readily compatible/mapped to the existing CDS seq metadata template to enable transfer of data submitted under this template to CDS (through the remainder of HTAN 1.0).

Re: level-agnostic, the approach here is to create a low-lift template for contributors to complete for their various file types. Once received by the HTAN DCC, the DCC will determine how submitted files should be organized according to existing levels and tiers of controlled access (i.e. if a fastq is submitted, assume L1 and controlled access).

aditigopalan commented 2 months ago

keep it level agnostic for contributors to submit, then we will break it down after submission
Clarisse has a mapping file (yaml file in data-release-tracker repo in ncihtan) that maps file from HTAN model to CDS model (starting point for a template)
template could be in sheet based format to begin with

aclayton555 commented 2 months ago

Discussed on 2024.05.07 HTAN DCC Ops call:

Consider calling this "Other Sequencing Assay"
Build in attributes from existing "Other Assay" template that provides assay descriptors for the portal. CDS does not currently distinguish single-cell vs bulk, so we can account for this in the template.
Overall, group supportive of this approach

@aditigopalan and @clarisse-lau can you please work together on this? Please let me know if it would be helpful to set up a time to work on this further. THANK YOU!

clarisse-lau commented 2 months ago

Thank you @aclayton555 and @aditigopalan! (small correction: the mapping file is in the ncihtan/cds_dbgap repo) I think it would be helpful to set up a call to talk through things. I have a couple of pending meetings, but this is my current availability this week:

Wednesday 10:30-11:30am,12:30-1:30 PT Thursday 9am or 12pm PT Friday 8am, 10am-12pm PT

aditigopalan commented 1 month ago

Sorry, I missed this! Are you still available 10:30am PT tomorrow? @clarisse-lau

clarisse-lau commented 1 month ago

No worries! 10:30 tomorrow works

aditigopalan commented 1 month ago

Just sent you an invite!

clarisse-lau commented 1 month ago

A thought on this... Component is one of the source attributes used in the CDS mapping file (to map library_strategy and library_source values. See CDS template)

As we cannot have Component twice in a template, we could instead include the Data Type attribute, which should provide sufficient information to map to the above fields.

aditigopalan commented 1 month ago

Here is the template for now, I replaced one of the components with "Data Type"

output.csv

Also here are the attributes: "Last Known Disease Status, Primary Diagnosis, Fixative Type, Treatment Outcome, SizeX, age_at_diagnosis_years, Genomic Reference, Race, NominalMagnification, Days to Recurrence, Morphology, Filename, SizeY, pi_last, Library Selection Method, Days to Last Known Disease Status, file_url_in_cds, pi_first, HTAN_Center, Biospecimen Type, Sequencing Platform, Tseries, Ethnicity, Tissue or Organ of Origin, File Format, SizeZ, PhysicalSizeY, channel_metadata_url, Microscope, Days to Last Follow up, HTAN Data File ID, LensNA, PhysicalSizeX, HTAN Participant ID, Library Layout, Vital Status, Software and Version, Treatment Type, Tumor Tissue Type, Objective, Pyramid, HTAN Biospecimen ID, SizeT, SizeC, Protocol Link, pi_email, Imaging Assay Type, Component, cancer_type, Site of Resection or Biopsy, WorkingDistance, md5, Immersion, Gender, File_Size, Zstack, Progression or Recurrence, Tumor Grade"

Should we re-arrange the fields for clarity? Would IT also help to have a definition of some fields (eg: SizeZ) or would the users be familiar with these names?

@clarisse-lau let me know what you think!

clarisse-lau commented 1 month ago

Thank you @aditigopalan !

As this template is intended to be specific to sequencing data, we can remove imaging-related attributes. The genomics mapping goes up to Row 940 of the mapping file, and is followed by mappings for various imaging metadata tables which don't need to be pulled into this CDS template.
The attributes included in the CDS template can be subsetted to those only found in the HTAN Data Model. There are a few 'source' attributes included (e.g. HTAN_Center, file_size, file_url_in_cds) that are actually not part of the HTAN data model, but are instead added at various points in the data flow process (either added to BigQuery tables or as part of the CDS manifest preparation script. sorry for the confusion).

These changes should simplify the CDS template quite a bit, and as it would only include existing data model elements, users will have access to definitions for each field from the HTAN data model.

Some rearranging to align with HTAN template conventions can be done at the implementation stage (i.e. using DependsOn in the data model. keeping it as an unordered csv for now is fine). Typically HTAN manifests start with the following attributes in this order (Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID), followed by all other metadata attributes.

clarisse-lau commented 1 month ago

Just had a chat with Ashley & Adam. We'd like to subset the attribute list even further to include only sequencing attributes (plus the descriptor columns: Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID, Data Type).

Clinical/biospecimen fields will be annotated separately by the center and pulled in from those templates respectively (as is currently done in the metadata generation scripts).

aclayton555 commented 1 month ago

@aditigopalan just checking on this and if there is anything you need the team to review at this stage. We are aiming to have this implemented and available for the Stanford center to test with the close out of our 24-5 sprint

aditigopalan commented 1 month ago

Thanks for checking in! Please let me know if this needs to be subsetted further @aclayton555 @adamjtaylor output.csv

adamjtaylor commented 1 month ago

Thanks @aditigopalan I think we only need to have those attributes that actually come from the sequencing technology as the others will come from our Biospecimen and Clinical elements. So lets drop those and keep:

Genomic Reference
Library layout
Data Type
Sequencing Platform
Library Selection Method

Plus the minimal HTAN columns for a component

HTAN Data File ID
HTAN Parent Biospecimen ID
Filename
File Format

adamjtaylor commented 1 month ago

@aditigopalan if you can open a draft PR and link to this issue that would be useful. Thank you!

aclayton555 commented 1 month ago

Add "CDS" prefix to all attributes for this template

adamjtaylor commented 1 month ago

Merged! @aditigopalan if you could generate the template using schematic or staging DCA and report if it looks sensible that would be great.

aditigopalan commented 1 month ago

@adamjtaylor tested using dca-staging! Looks alright to me.

aclayton555 commented 1 month ago

AMAZING (and radical) COLLABORATION ON THIS!

adamjtaylor commented 1 month ago

Not quite out of the woods yet! @aditigopalan is chasing down an errant loop in the DAG that is dragging some extra attributes into the template.

ncihtan / data-models

Develop and implement HTAN/CDS seq template #396