nf-core / modules

Repository to host tool-specific module files for the Nextflow DSL2 community!
https://nf-co.re/modules
MIT License
276 stars 688 forks source link

fastq outputs are missed during cellranger mkfastq due to directory structure #6189

Open julicudini opened 1 month ago

julicudini commented 1 month ago

Have you checked the docs?

Description of the bug

I noticed when running the nfcore/demultiplex pipeline withcellranger_mkfastq as the demultiplexer module that the outputs do not match what I get when I run cellranger mkfastq (same version, and same version of bcl2fastq) on its own, independent of nextflow. I tracked down that this is not an issue with the demultiplex pipeline but instead the cellranger_mkfastq module. What is missing is the outs/fastq_path/Sample_Project dir, which contains the sample fastqs, whereas the outs/fastq_path dir only contains the Undetermined fastqs (explained here):

This example was produced with a sample sheet that included tiny-bcl as the Sample_Project, so the directory containing the sample folders is called tiny-bcl. If a Sample_Project was not specified, or if a simple layout CSV file was used (which does not have a Sample_Project column), the directory containing the sample folders would be named according to the flow cell ID instead. ls -l tiny-bcl/outs/fastq_path/

drwxr-xr-x 3 jdoe jdoe 3 Nov 14 12:26 Reports drwxr-xr-x 2 jdoe jdoe 8 Nov 14 12:26 Stats drwxr-xr-x 3 jdoe jdoe 3 Nov 14 12:26 tiny-bcl (note this is the key dir where sample fastqs are) -rw-r--r-- 1 jdoe jdoe 20615106 Nov 14 12:26 Undetermined_S0_L001_I1_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 20615106 Nov 14 12:26 Undetermined_S0_L001_I2_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 51499694 Nov 14 12:26 Undetermined_S0_L001_R1_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 152692701 Nov 14 12:26 Undetermined_S0_L001_R2_001.fastq.gz

What this means is that the line that defines the output of cellranger mkfastq in main.nf as outs/fastq_path/*.fastq.gz only captures the Undetermined files and misses the actual sample files. Currently the line reads
path "**/outs/fastq_path/*.fastq.gz", emit: fastq
and I was able to fix this by changing the line to
path("*_outs/outs/fastq_path/{*.fastq.gz,**/*.fastq.gz}"), emit: fastq Which instead captures any fastq file in any nested dir. I think this is better than trying to infer the flowcell id in order to search the directory that may or may not be made. I have a PR that I can submit to make this small fix

Command used and terminal output

No response

Relevant files

No response

System information

No response

apeltzer commented 1 month ago

Should also be a bug report in demultiplex, we might have to do a release for 1.5.1 to fix this

apeltzer commented 1 month ago

cc @atrigila / @nschcolnicov

nschcolnicov commented 1 month ago

@julicudini Thank you very much for the detailed description and proposed solution, I'll fix this ASAP