theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[SRA-Fetch] Detect SRA-Lite when it's low quality file #512

Closed cimendes closed 2 months ago

cimendes commented 3 months ago

This PR closes #480

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

In SRA-Lite format, reject reads have a set quality encoding of 3, which is represented by the '$' character. This was not taken into consideration by our previous attempt to auto-detect this format by checking the quality-encoding range.

This is a fix to also report SRA-Lite when just the ? or $ characters are detected in the first line of quality-encoding characters.

Note: If the read quality only contains ? and $ characters, it will be reported as SRA-Lite. Given that one encodes for Q-30 and the other for Q-3, it is extremely unlikely that this would occur naturally.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: N/A

Databases or database versions changed: N/A

Data processing/commands changed: N/A

File processing changed: N/A

Compute resources changed: N/A

➡️ Inputs

Nothing changed

⬅️ Outputs

Nothing changed

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

The ^[?$]+$ regex was tested individually:

image

Terra Testing

Suggested Scenarios for Reviewer to Test

Samples with Q30, Q3 and normal quality encoding To "force" the SRA-Lite format download, use "--sra-lite --provider sra --only-provider" as fastq_dl_opts input argument on SRA_Fetch.

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)