ncihtan / data-models

Schema.org Data Models for HTAN
MIT License
14 stars 6 forks source link

[BulkRNAseq Level 2] Updates to accommodate Salmon files #399

Closed aclayton555 closed 1 month ago

aclayton555 commented 1 month ago

This ticket emerges from HTAN-446: https://sagebionetworks.jira.com/browse/HTAN-446

Our bulkRNA seq data consists of FASTQ level 1 files (for which there is an existing assay, no issues there) and of quant.sf level 2 files from Salmon. The BulkRNAseq Level 2 assay doesn’t seem to work well for them: .sf is not a valid extension; Salmon is not an option for Alignment Workflow Type; and an additional index file is required. OtherAssay would also not work well for these files, as they are level 2 and I don’t think OtherAssay has a parent column.

Actions required:

aclayton555 commented 1 month ago

@adamjtaylor do you have bandwidth to implement this update and push this interim release?

adamjtaylor commented 1 month ago

Reading through our current implementation of the data model there are some relevant points to consider:

Alignment Workflow Type: Generic name for the workflow used to analyze a data set Valid values do not include Salmon, but do include Other Workflow Type which when used adds additional options for a custom aligner. Also includes None option.

Pseudo Alignment Used: Pseudo aligners such as Kallisto or Salmon do not produce aligned reads BAM files. True indicates pseudoalignment was used This takes a Yes/No answer and if Yes is selected requires additional columns about the Pseudo Aligner used This is an attribute used by Bulk-RNA-seq level 3

So I think the conclusion here is that Stanford should submit their SF files as Bulk RNA seq level 3 not level 2 and use the Pseudo Alignment Used option

The linked PR therefore just adds sf as a valid value to File Format