nf-core / scrnaseq

A single-cell RNAseq pipeline for 10X genomics data
https://nf-co.re/scrnaseq
MIT License
214 stars 172 forks source link

cellranger multi => MTX_TO_H5AD: input file name collision #383

Open nick-youngblut opened 1 month ago

nick-youngblut commented 1 month ago

Description of the bug

Running CellRanger with all GEX samples, in which there are multiple barcodes per sample, but the all go to the same sample (see the samples & barcodes tables below). This results in a file name collision at the MTX_TO_H5AD step. I haven't been able to determine why, based on the pipeline code.

Command used and terminal output

The command:

nextflow run main.nf \
  -ansi-log false \
  -profile singularity \
  -process.executor slurm \
  -process.queue cpu_batch \
  -work-dir /scratch/$(id -gn)/$(whoami)/nextflow-work/scrnaseq \
  --aligner cellrangermulti \
  --skip_cellrangermulti_vdjref \
  --skip_emptydrops \
  --gex_frna_probe_set ${PROBE_REF_DIR}/Chromium_Human_Transcriptome_Probe_Set_v1.0.1_GRCh38-2020-A.csv \
  --cellranger_index ${GENOME_REF_DIR}/refdata-gex-GRCh38-2020-A/ \
  --cellranger_multi_barcodes tmp/sample_barcodes.csv \
  --input tmp/samples.csv \
  --outdir tmp/scrnaseq_output
ERROR ~ Error executing process > 'NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (2)'

Caused by:
  Process `NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD` input file name collision -- There are multiple input files for each of the following file names: barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz

Relevant files

The samples table (full paths removed for clarity):

sample,fastq_1,fastq_2,feature_type
20240905_ADI_batch3_flex_1,20240905_ADI_batch3_flex_1_S1_L001_R1_001.fastq.gz,20240905_ADI_batch3_flex_1_S1_L001_R2_001.fastq.gz,gex
20240905_ADI_batch3_flex_2,20240905_ADI_batch3_flex_2_S1_L001_R1_001.fastq.gz,20240905_ADI_batch3_flex_2_S1_L001_R2_001.fastq.gz,gex
20240925_ADI_batch5_flex_1,/20240925_ADI_batch5_flex_1_R1_001.fastq.gz,/20240925_ADI_batch5_flex_1_R2_001.fastq.gz,gex
20240925_ADI_batch5_flex_2,/20240925_ADI_batch5_flex_2_R1_001.fastq.gz,/20240925_ADI_batch5_flex_2_R2_001.fastq.gz,gex
20240925_ADI_batch5_flex_3,/20240925_ADI_batch5_flex_3_R1_001.fastq.gz,/20240925_ADI_batch5_flex_3_R2_001.fastq.gz,gex
20240925_ADI_batch5_flex_4,/20240925_ADI_batch5_flex_4_R1_001.fastq.gz,/20240925_ADI_batch5_flex_4_R2_001.fastq.gz,gex

The sample barcodes table:

sample,multiplexed_sample_id,probe_barcode_ids,cmo_ids,description
20240905_ADI_batch3_flex_1,20240905_ADI_batch3_flex_1,BC001|BC002|BC003|BC004,,
20240905_ADI_batch3_flex_2,20240905_ADI_batch3_flex_2,BC001|BC002|BC003|BC004,,
20240925_ADI_batch5_flex_1,20240925_ADI_batch5_flex_1,BC001|BC002|BC003|BC004,,
20240925_ADI_batch5_flex_2,20240925_ADI_batch5_flex_2,BC001|BC002|BC003|BC004,,
20240925_ADI_batch5_flex_3,20240925_ADI_batch5_flex_3,BC001|BC002|BC003|BC004,,
20240925_ADI_batch5_flex_4,20240925_ADI_batch5_flex_4,BC001|BC002|BC003|BC004,,

System information

Nextflow: 24.04.4.5917 Hardward: HPC Executor: SLURM Engine: Apptainer OS: Ubuntu Pipeline: 2.7.1

nick-youngblut commented 1 month ago

Adding mtx_matrices.view() to MTX_CONVERSION shows that all of the samples have the same file names, which is causing the name collision:

[
  [id:20240925_ADI_batch5_flex_3, ...],
  [
    /path/to/sample1/barcodes.tsv.gz,
    /path/to/sample1/features.tsv.gz,
    /path/to/sample1/matrix.mtx.gz,
    ...
  ]
]
nick-youngblut commented 1 month ago

It might help to include info on how to handle multiple barcodes per sample in the sample barcode table: https://nf-co.re/scrnaseq/2.7.1/docs/usage ("Additional samplesheet for multiplexed samples").

For example:

sample,multiplexed_sample_id,probe_barcode_ids,cmo_ids,description
20240905_ADI_batch3_flex_1,20240905_ADI_batch3_flex_1,BC001|BC002|BC003|BC004,,

From the 10X docs:

If multiple Probe Barcodes were used for a sample, separate IDs with a pipe (e.g., BC001|BC002).