nf-core / demultiplex

Demultiplexing pipeline for sequencing data
https://nf-co.re/demultiplex
MIT License
43 stars 36 forks source link

Fastq files missing from output / report, 10x single cell #135

Closed rifius closed 2 months ago

rifius commented 1 year ago

Description of the bug

I run the bclconvert demux on a 10x single cell and some fastq files are not copied / linked to the output folder, their md5sums not computed and falco, fastp processes on them are not launched. In consequence, they are also missing from the QC reports.

Setup:

[BCLConvert_Settings] CreateFastqForIndexReads,0

[BCLConvert_Data] Sample_ID,index CAM17_RoGr,ACGTCCCT CAM17_RoGr,CGCATGTG CAM17_RoGr,GAAGGAAC CAM17_RoGr,TTTCATGA CAM22_JoMc,AACCGTAA CAM22_JoMc,CTAAACGG CAM22_JoMc,GGTTTACT CAM22_JoMc,TCGGCGTC CAM27_ElBr,AACGTCAA ....

- Nextflow `demultiplex` sample sheet:

id,samplesheet,lane,flowcell A00999,/full/path/to/bclcvt-sampleindex.csv,,/full/bcl/data/path

As per the docs, when lane is not given, all lanes will be processed.

### Results

Nextflow run completes with success, no errors listed.  However, only 21 out of the 31 samples are linked in the output dir and listed in the MultiQC report.

With `-dump-channels`, the output of the `BCLCONVERT` module tagged as `DEMULTIPLEX::Demultiplexed Fastq` contains 84 items (that is: 21 samples by 4 lanes).
The working folder of `BCLCONVERT` task contains all expected 256 `.fastq.gz` files (that is: 31 samples x 4 lanes x R1/R2 + 8 `Undetermined` files: 4 lanes x R1/R2), which means demultiplexing ran Ok.

I can't figure out what could be causing this behaviour, or how to quickly troubleshoot.  I will manually add the 10 missing sample links to the output folder and continue with my downstream analysis, but reporting of this stage is incomplete until this is solved.

On different runs, it is always the same samples that are missing (for instance, sample `CAM17_RoGr` above is always missing from output).

### Command used and terminal output

```console
$ nextflow run nf-core/demultiplex --input nf-samplesheet.csv --outdir DMUX --demultiplexer bclconvert --trim_fastq false -bg -profile podman -dump-channels

(also tried with -resume)

Cleaned .nextflow.log output included below.

Relevant files

nf.log.gz

System information

Version: 23.04.1 build 5866 Created: 15-04-2023 06:51 UTC (16:51 AEDT) System: Linux 6.3.12-200.fc38.x86_64 Runtime: Groovy 3.0.16 on OpenJDK 64-Bit Server VM 17.0.6+10 Encoding: UTF-8 (UTF-8) Process: 2401714@my-machine [10.x.x.x] CPUs: 32 - Mem: 503.3 GB (6.2 GB) - Swap: 0 (0)

nf-core/demultiplex v1.3.2-g67b8465

Container engine: podman rootless OS: Fedora Core OS

edmundmiller commented 1 year ago

@matthdsm any thoughts? I'm wondering if it's the publishing and the naming of the sample with an _ that's the issue.

matthdsm commented 1 year ago

I think it's the sample naming that's the issue here. De demux modules glob on **[!Undetermined]_S*_R?_00?.fastq.gz to find the output fastq's and the names with _R* cause some kind of collision.

@rifius, could you post an ls of the bclconvert work dir so we can check out the filenames?

https://github.com/nf-core/modules/blob/97b7dc798a002688b6304a453da932b2144727b1/modules/nf-core/bclconvert/main.nf#L11

apeltzer commented 2 months ago

Other option now available: 10X mkfastq is now available on dev and soon in 1.5.0 too