nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
187 stars 117 forks source link

Filtering low read count samples during QC #556

Closed d4straub closed 1 year ago

d4straub commented 1 year ago

Description of feature

Currently (2.5.0), empty input files can be ignored with --ignore_empty_input_files, or samples after trimming with --ignore_failed_trimming. "empty input files" is checked based on compressed fastq file sizes (< 1.KB) using file.size(), see subworkflows/local/parse_input.nf and subworkflows/local/cutadapt_workflow.nf. A better solution might be file.countFastq(), example from here:

channel
    .fromPath( 'data/yeast/reads/*.fq.gz' )
    .map ({ file -> [file, file.countFastq()] })
    .filter({ file, numreads -> numreads > 25000})
    .view ({ file, numreads -> "file $file contains $numreads reads" })

This way an exact read count threshold could be defined and even modified if desired. The disadvantage might be the computational overhead to check file lines.