nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
180 stars 113 forks source link

phred score mis-detection by dada2 #773

Open amakunin opened 3 weeks ago

amakunin commented 3 weeks ago

Description of the bug

I’ve been testing ampliseq on some Element Biosciences data using standard Illumina settings and discovered an interesting issue where DADA2_ERR fails with error that looks like this

  Error in dada(drps, err = NULL, errorEstimationFunction = errorEstimationFunction,  : 
    Invalid derep$quals matrix. Quality values must be positive integers.
  Calls: learnErrors -> dada
      Execution halted

Turns out, the readFastq function that is being called under the hood mis-interprets ElemBio Phred scores as Phred+64, not Phred+33 - probably because maximium score in ElemBio is 55, while for Illumina it is 41.

I was able to work around this by changing qualityType = "Auto" to qualityType = "FastqQuality" in DADA2_ERR module config.

In addition, I needed to adjust DADA2_DENOISING code to use appropriate qualityType in reading fastq files:

        filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE), method = "radix")
        filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE), method = "radix")

        #denoising
        sink(file = "${prefix}.dada.log")
        derepFs <- derepFastq(filtFs, qualityType="FastqQuality")
        dadaFs <- dada(derepFs, err = errF, $args, multithread = $task.cpus)
        saveRDS(dadaFs, "${prefix}_1.dada.rds")
        derepRs <- derepFastq(filtRs, qualityType="FastqQuality")
        dadaRs <- dada(derepRs, err = errR, $args, multithread = $task.cpus)
        saveRDS(dadaRs, "${prefix}_2.dada.rds")
        sink(file = NULL)

I would appreciate if it would be possible to explicitly set qualityType in DADA2_ERR and DADA2_DENOISING modules - though I’m not sure if this should be:

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.11.0 --input $manifest \
        -c $config -profile singularity --outdir ampliseq --skip_fastqc \
        --FW_primer AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --RV_primer AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA \
        --retain_untrimmed --illumina_pe_its --skip_qiime --skip_taxonomy \
        --ignore_empty_input_files --ignore_failed_trimming --ignore_failed_filtering --skip_barrnap -resume

Relevant files

No response

System information

linux x86_64 nextflow =24.04.02 ampliseq =2.11.0 singularity =3.10.0 local executor

amakunin commented 3 weeks ago

UPD - another place where qualities should be specified is DADA2_FILTNTRIM - qualityType = "Auto" is specified in ext.args section of module config as of ampliseq 2.11.0.

I am also unsure if qualityType should be specified in DADA2_QUALITY - the main functionplotQualityProfile does not seem to accept qualityType as an argument thogh

erikrikarddaniel commented 3 weeks ago

Thanks for reporting, and thanks for the update. The main developer is on vacation, so nothing will happen for a few weeks. We might come back to you with further questions when he's back.

amakunin commented 2 weeks ago

And another small update - unfortunately I won't be able to finish the test run. The reverse reads in my dataset had very low qualities, so did not pass DADA2_DENOISING (1_2.dada.rds file was not generated). As a result DADA2_STATS module failed with dadaRs = readRDS("${denoised[1]}") being transformed into dadaRs = readRDS("null") - maybe it would be possible to add a check if two files are being generated by denoising for paired-end data here?

d4straub commented 1 week ago

Hi there,

a separate pipeline parameter setting phred to default (“Auto” ), 33 (“FastqQuality”) or 64 (“SFastqQuality”)

that sounds like a good addition to me. However, test data (that could also be uploaded to our test data repo https://github.com/nf-core/test-datasets/tree/ampliseq) would be extremely important.

DADA2_QUALITY is primarily for visualization and also aids to find length cutoffs for DADA2_FILTNTRIM, so it seems not surprising that DADA2_QUALITY has no quality type setting.

amakunin commented 1 week ago

Hi @d4straub - it seems that DADA2_QUALITY also auto-infers Phred score, but in a different way - plotErrors function calls qa function from ShortRead library - I could not figure out if it supports different Phred scores, or just auto-infers Phred+33. In my case it correctly interpreted the data as Phred+33.

I'd be happy to provide some test data, but for now I only have forward reads of decent quality, while all reverse reads have low quality - I'll need to either generate fake Phred scores or wait for better data to appear. Please also keep in mind that my data is not compatible with downstream applications like QIIME, so it might be easier to replace a few Phred chars corresponding to base qualities between 41-55 in your existing test datasets

d4straub commented 2 days ago

If only forward reads are ok, isnt that also sufficient for implementing phred score flexibility? I would think so.