nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 118 forks source link

minor improvement of sort() before denoising with method = "radix #706

Closed ulrichmarkus closed 8 months ago

ulrichmarkus commented 9 months ago

Description of feature

Hi @d4straub,

it is maybe a problem in 0.001% of the cases but could save some resources on the user side. Especially when sampleIDs are not standardized, it could happen that the standard sort() algorithm from R is mixing up sampleIDs which is causing errors at a later point when it comes to mergePairs().

One example I had was a pair of sampleID that looked like this: [1] S1982 and [2] S1982_2. During the pipeline the sampleIDs are used as filenames and extends those with _1/_2 for R1/R2. What is now happening is that the standard sort algorithm values an underscore differently than a point and sorts the second sample before the first for the R2 reads. Thus the minor improvement for dada2_denoising.nf could be to add a sort method = "radix to this step. Radix sort seems to have other internal values for underscore and points than the standard sort().

Standard sort()

filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE))
filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE))
head(filtFs)
[1] "./filtered/M1982_1.filt.fastq.gz"
[2] "./filtered/M1982_2_1.filt.fastq.gz"
head(filtRs)
[1] "./filtered/M1982_2_2.filt.fastq.gz"
[2] "./filtered/M1982_2.filt.fastq.gz"

Radix sort()

filtFs <- sort(list.files("./filtered/", pattern = "_1.filt.fastq.gz", full.names = TRUE), method = "radix")
filtRs <- sort(list.files("./filtered/", pattern = "_2.filt.fastq.gz", full.names = TRUE), method = "radix")
head(filtFs)
[1] "./filtered/M1982_1.filt.fastq.gz"
[2] "./filtered/M1982_2_1.filt.fastq.gz"
head(filtRs)
[1] "./filtered/M1982_2.filt.fastq.gz"
[2] "./filtered/M1982_2_2.filt.fastq.gz"

Best, Markus

d4straub commented 9 months ago

Hi Markus,

thanks for the report and also the proposed solution! I agree, after some more testing that seems like a good addition to make the pipeline more robust.

Best, Daniel

d4straub commented 8 months ago

Thanks for the report again, the change is implemented into the dev version and will be in the next release.