incorrect merging (CAT_FASTQ) of different samples

ojziff commented 2 years ago

Description of the bug

The CAT_FASTprocess is incorrectly merging different samples that share the same prefix but different suffix names in the sample column of samplesheet.csv. For example CONTROL_1 and CONTROL_2 are being incorrectly merged but TREATMENT_1 and CONTROL_1 are not being merged. When i run nfcore/rnaseq with the same samplesheet.csv this doesn't happen. You can see in my samplesheet that there are 18 unique samples which should not be merged but CAT_FASTQ is merging them into 4 samples: c9orf72, ctrl, fus and iso.

I think this is being caused by this split by _ in the meta.id: https://github.com/nf-core/rnavar/blob/3924aac34ce715414fad953f41d98e98d0981fb8/workflows/rnavar.nf#L142

presumably meta.id.split has been copied over from an old rnaseq pipeline but has since been removed. This is the latest rnaseq pipeline equivelent for comparison: https://github.com/nf-core/rnaseq/blob/89bf536ce4faa98b4d50a8ec0a0343780bc62e0a/workflows/rnaseq.nf#L192

Command used and terminal output

nextflow run nf-core/rnavar \
--input samplesheet.csv \
--outdir rnavar \
-c rnavar.config \
--igenomes_ignore true \
--read_length 150 \
--email oliver.ziff@crick.ac.uk -profile crick -r dev

output

[28/8f289e] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:GTF2BED (genes.gtf)                                [100%] 1 of 1, cached: 1 ✔
[1b/cf281c] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:STAR_GENOMEGENERATE (genome.fa)                    [100%] 1 of 1, cached: 1 ✔
executor >  slurm (6)
[28/8f289e] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:GTF2BED (genes.gtf)                                [100%] 1 of 1, cached: 1 ✔
[1b/cf281c] process > NFCORE_RNAVAR:RNAVAR:PREPARE_GENOME:STAR_GENOMEGENERATE (genome.fa)                    [100%] 1 of 1, cached: 1 ✔
[90/5cf789] process > NFCORE_RNAVAR:RNAVAR:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)                   [100%] 1 of 1, cached: 1 ✔
[8a/a5098c] process > NFCORE_RNAVAR:RNAVAR:CAT_FASTQ (ctrl)                                                  [100%] 4 of 4, cached: 4 ✔

Relevant files

samplesheet.csv

rnavar.config:

params {
    bwa                   = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/version0.6.0/'
    chr_dir               = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/Chromosomes"
    dbsnp                 = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Homo_sapiens_assembly38.dbsnp138.vcf'
    dbsnp_tbi             = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Homo_sapiens_assembly38.dbsnp138.vcf.idx'
    known_indels          = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
    known_indels_tbi      = '/camp/lab/luscomben/home/users/ziffo/genomes/rnavar/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi'
    dict                  = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.dict"
    fasta                 = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
    fasta_fai             = "/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa.fai"
    snpeff_db             = 'GRCh38.99'
    snpeff_genome         = 'GRCh38'
    vep_cache_version     = '104'
    vep_genome            = 'GRCh38'
    vep_species           = 'homo_sapiens'
    bowtie2               = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/'
    star                  = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/'
    bismark               = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/'
    gtf                   = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf'
    bed12                 = '/camp/svc/reference/Genomics/aws-igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed'
}

System information

N E X T F L O W ~ version 21.10.3 HPC at Crick Executor slurm Container singularity OS Linux version dev

praveenraj2018 commented 2 years ago

Need to discuss this internally in the nf-core team. As per the default logic, the last part will be stripped off and I believe it was to combine technical replicates for RNAseq, but it may also fail for some use cases as you have. Let me see what is the response on this from the core-team.

praveenraj2018 commented 2 years ago

Fixed in PR - https://github.com/nf-core/rnavar/pull/53

ojziff commented 2 years ago

I can confirm this is now fixed in the updated dev branch! Thanks very much @praveenraj2018

nf-core / rnavar