theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Add a warning column to SRA_Fetch if SRALite file format is detected #358

Closed cimendes closed 6 months ago

cimendes commented 7 months ago

:cool:

:pushpin: Explain the Request

SRA has a pesky little format called SRALite where all phred quality encoding is set to 30 (aka ? character). We have set the fastq-dl to fetch from SRA but sometimes this fails and it fetches from ENA. Sometimes, in those cases, SRALite files can be downloaded. Additionally, sometimes the original files in SRA have been replaced by SRALite at the source.

A really nice to have is to warn the user that the downloaded file has been detected as SRALite with a warning column. To check we we'll have to parse and condense the found quality scores and if they are all '?', we have a pesky little SRALite! This warning column would remain empty if the file is found to be "normal".

michellescribner commented 7 months ago

Note from discussion: could we determine quality encoding before running trimmomatic and pass the encoding to the the trimmomatic command? (Trimmomatic failure with SRAlite data is thought to be due to failure to detect encoding)