theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[SRA_Fetch] Ensure that the SRA Lite format IS NOT being downloaded #269

Closed cimendes closed 7 months ago

cimendes commented 10 months ago

:bug:

:pencil: Describe the Issue

Ever since the introduction of the SRA Lite format, we've been plagued with TheiaProk failing in trimmomatic due to minimum quality errors. The SRA Lite format encodes all quality scores as 30 for all bases in the file.

When running SRA_Fetch we must ensure that this file is not the one that is downloaded, instead fetching the SRA normalized file where all quality scores are as they should.

@rpetit3 any suggestions on how to accomplish this with fastq-dl? :)

kapsakcj commented 10 months ago

This is a tricky one that won't have a perfect solution due to SRA occasionally hosting ONLY the SRA Lite formatted FASTQ files (and not the original FASTQ files). I try to report these to NCBI SRA whenever I come across them.

But in most cases (maybe 98% of the time?), SRA does provide the original FASTQ files so we can tell fastq-dl to restrict it's download source to be NCBI SRA and do not fall back upon ENA (which I've found often synced the SRA Lite formatted FASTQ files to ENA and not the original ones from NCBI).

You can do this by setting fastq_dl_opts to "--only-provider --provider sra"

The current workflow default for fastq_dl_opts is "--provider sra", so it does have the ability to fall back upon ENA when SRA fails.

Another way to mitigate these failures is to tell trimmomatic what the Quality score encoding is by setting trimmomatic_args equal to "-phred33". That way, even if SRA Lite formatted FASTQs are passed into TheiaProk, the workflow will run without failure. The downside is that the quality scores are "false" and may affect the assembly process and distort QC metrics.

P.S. It's my understanding that Illumina has been using phred33 encoding for a long time (10+ years or something?) so it's pretty safe to pass this option into trimmomatic. FYI this is not a workflow/task default for trimmomatic

cimendes commented 7 months ago

Linked to #358

sage-wright commented 7 months ago

This issue cannot be resolved because there's no way to stop SRA Lite from being downloaded. Will resolve with a warning in issue #358