nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.75k stars 629 forks source link

splitFastq does not split beyond the second file in PE mode #4798

Open JohnMMa opened 8 months ago

JohnMMa commented 8 months ago

Bug report

(Please follow this template replacing the text between parentheses with the requested information)

Expected behavior and actual behavior

The current Nextflow docs for splitFastq states:

Finally the splitFastq operator is able to split paired-end read pair FASTQ files. It must be applied to a channel which emits tuples containing at least two elements that are the files to be split.

while the description for the pe argument states:

When true splits paired-end read files, therefore items emitted by the source channel must be tuples in which at least two elements are the read-pair files to be split.

This implies when splitFastq is used with pe: true, it is expected to split an unlimited number of FASTQ files for each entry of the channel. However, as from the output below, only the first two files are split. This wasn't a problem (yet) in 2019, but becomes a problem now due to some single-cell sequencing platforms require 3 FASTQ files as input.

Steps to reproduce the problem

Channel
    .fromFilePairs('test/test*_{R1,R2,I1}_[0-9][0-9][0-9].fastq.gz', size:3, flat:true)
    .splitFastq(by: 10, pe:true, file:true)
    .view()

Program output

N E X T F L O W  ~  version 23.10.1
Launching `splitFastq_report.nf` [trusting_picasso] DSL2 - revision: 58e150738f
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.1.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.1.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.2.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.2.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.3.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.3.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.4.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.4.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.5.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.5.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.6.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.6.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.7.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.7.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.8.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.8.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.9.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.9.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]
[test_S1_L001, /home/jma/Documents/work/35/01beb071fcd1260524d0f1b592a777/test_S1_L001_I1_001.10.fastq, /home/jma/Documents/work/00/11bcf5753d6e3968c95dc2829f7535/test_S1_L001_R1_001.10.fastq, /home/jma/Documents/test/test_S1_L001_R2_001.fastq.gz]

Environment

Additional context

I currently think the issue is in the following code block in SplitOp.groovy, currently in lines 92-96, which hard-codes the indices:

        if( params.pe == true ) {
            indexes = [-1,-2]
            multiSplit = true
            pairedEnd = true
        }

However, a fix requires the operator to be able to read from at least one entry of the source channel to determine indexes. However, I don't know enough Groovy/Java to know if this is at all possible. If not, then just change the documentation.

test_S1_L001_I1_001.fastq.gz test_S1_L001_R1_001.fastq.gz test_S1_L001_R2_001.fastq.gz .nextflow.log

(EDIT: Updated with possible cause.)

marcodelapierre commented 7 months ago

@ewels do you have any comments here? Thank you

ewels commented 7 months ago

Agree that it would be nice to be able to generate a tuple of an arbitrary number of files if possible (just taking the number of elements in the squiggly brackets)