nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
404 stars 405 forks source link

MarkDuplicates step fails with UMIs from read names and 4 lanes #802

Open sci-kai opened 2 years ago

sci-kai commented 2 years ago

Description of the bug

I am using Sarek with a configuration that reads UMIs from the FASTQ read names. The workflow fails at the "MarkDuplicates" step with the error message: htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHD789:806050 According to the documentation of this error, it can result from reads having the same read name in the BAM file. I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. During that process, the read names are changed to a scheme containing the sample name HD789 and a continuous number. When BAMs for all four lanes are merged, this results into different reads having the same name (4x readpairs since I have 4 lanes) that results into this error in the MarkDuplicates step.

In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.

Command used and terminal output

nextflow run nf-core-sarek-3.0.2/workflow/ --input run1_samplesheet.csv --outdir run1_full --genome GATK.GRCh38 --igenomes_base igenomes/references -profile singularity -c TSO500.config --tools freebayes,mpileup,mutect2,strelka,manta,tiddit,merge -resume --umi_read_structure NA --group_by_umi_strategy paired --snpeff_cache snpeff/ --vep_cache vep/

Error executing process > 'NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES (HD789)'

Caused by:
  Process `NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES (HD789)` terminated with an error exit status (3)

Command executed:

  gatk --java-options "-Xmx30g" MarkDuplicates \
      --INPUT HD789-L002.0006.bam --INPUT HD789-L002.0002.bam --INPUT HD789-L002.0003.bam --INPUT HD789-L002.0001.bam --INPUT HD789-L002.0005.bam --INPUT HD789-L002.0004.bam --INPUT HD789-L004.0001.bam --INPUT HD789-L004.0003.bam --INPUT HD789-L004.0006.bam --INPUT HD789-L004.0004.bam --INPUT HD789-L004.0002.bam --INPUT HD789-L004.0005.bam --INPUT HD789-L001.0006.bam --INPUT HD789-L001.0003.bam --INPUT HD789-L001.0002.bam --INPUT HD789-L001.0005.bam --INPUT HD789-L001.0004.bam --INPUT HD789-L001.0001.bam --INPUT HD789-L003.0001.bam --INPUT HD789-L003.0002.bam --INPUT HD789-L003.0004.bam --INPUT HD789-L003.0003.bam --INPUT HD789-L003.0006.bam --INPUT HD789-L003.0005.bam \
      --OUTPUT HD789.md.bam \
      --METRICS_FILE HD789.md.metrics \
      --TMP_DIR . \
      -REMOVE_DUPLICATES false -VALIDATION_STRINGENCY LENIENT --CREATE_INDEX true

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:MARKDUPLICATES:GATK4_MARKDUPLICATES":
      gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
  END_VERSIONS

Command exit status:
  3

Command output:
  (empty)

Command error:
  INFO  2022-10-18 10:53:55 MarkDuplicates  Tracking 9051 as yet unmatched pairs. 546 records in RAM.
  INFO  2022-10-18 10:53:59 MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:12s.  Time for last 1,000,000:    3s.  Last read position: chr1:113,104,534
  INFO  2022-10-18 10:53:59 MarkDuplicates  Tracking 16635 as yet unmatched pairs. 708 records in RAM.
  INFO  2022-10-18 10:54:02 MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:16s.  Time for last 1,000,000:    3s.  Last read position: chr1:156,880,088
  INFO  2022-10-18 10:54:02 MarkDuplicates  Tracking 20851 as yet unmatched pairs. 895 records in RAM.
  INFO  2022-10-18 10:54:07 MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:21s.  Time for last 1,000,000:    4s.  Last read position: chr1:193,212,181
  INFO  2022-10-18 10:54:07 MarkDup Elapsed time: 00:00:16s.  Time for last 1,000,000:    3s.  Last read position: chr1:156,880,088
  INFO  2022-10-18 10:54:02 MarkDuplicates  Tracking 20851 as yet unmatched pairs. 895 records in RAM.
  INFO  2022-10-18 10:54:07 MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:21s.  Time for last 1,000,000:    4s.  Last read position: chr1:193,212,181
  INFO  2022-10-18 10:54:07 MarkDuplicates  Tracking 25749 as yet unmatched pairs. 558 records in RAM.
  INFO  2022-10-18 10:54:11 MarkDuplicates  Read     6,000,000 records.  Elapsed time: 00:00:25s.  Time for last 1,000,000:    4s.  Last read position: chr1:231,436,969
  INFO  2022-10-18 10:54:11 MarkDuplicates  Tracking 30831 as yet unmatched pairs. 193 records in RAM.
  INFO  2022-10-18 10:54:17 MarkDuplicates  Read     7,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    5s.  Last read position: chr2:15,946,332
  INFO  2022-10-18 10:54:17 MarkDuplicates  Tracking 39249 as yet unmatched pairs. 6622 records in RAM.
  INFO  2022-10-18 10:54:23 MarkDuplicates  Read     8,000,000 records.  Elapsed time: 00:00:37s.  Time for last 1,000,000:    5s.  Last read position: chr2:29,511,003
  INFO  2022-10-18 10:54:23 MarkDuplicates  Tracking 43958 as yet unmatched pairs. 8035 records in RAM.
  INFO  2022-10-18 10:54:28 MarkDuplicates  Read     9,000,000 records.  Elapsed time: 00:00:42s.  Time for last 1,000,000:    5s.  Last read position: chr2:74,090,879
  INFO  2022-10-18 10:54:28 MarkDuplicates  Tracking 49514 as yet unmatched pairs. 2513 records in RAM.
  INFO  2022-10-18 10:54:32 MarkDuplicates  Read    10,000,000 records.  Elapsed time: 00:00:46s.  Time for last 1,000,000:    3s.  Last read position: chr2:113,223,748
  INFO  2022-10-18 10:54:32 MarkDuplicates  Tracking 54431 as yet unmatched pairs. 3001 records in RAM.
  INFO  2022-10-18 10:54:37 MarkDuplicates  Read    11,000,000 records.  Elapsed time: 00:00:51s.  Time for last 1,000,000:    4s.  Last read position: chr2:141,169,200
  INFO  2022-10-18 10:54:37 MarkDuplicates  Tracking 57016 as yet unmatched pairs. 1670 records in RAM.
  INFO  2022-10-18 10:54:40 MarkDuplicates  Read    12,000,000 records.  Elapsed time: 00:00:54s.  Time for last 1,000,000:    3s.  Last read position: chr2:211,630,484
  INFO  2022-10-18 10:54:40 MarkDuplicates  Tracking 63059 as yet unmatched pairs. 1454 records in RAM.
  INFO  2022-10-18 10:54:45 MarkDuplicates  Read    13,000,000 records.  Elapsed time: 00:00:58s.  Time for last 1,000,000:    4s.  Last read position: chr3:10,039,325
  INFO  2022-10-18 10:54:45 MarkDuplicates  Tracking 66844 as yet unmatched pairs. 5403 records in RAM.
  INFO  2022-10-18 10:54:49 MarkDuplicates  Read    14,000,000 records.  Elapsed time: 00:01:03s.  Time for last 1,000,000:    4s.  Last read position: chr3:24,097,025
  INFO  2022-10-18 10:54:49 MarkDuplicates  Tracking 69255 as yet unmatched pairs. 4605 records in RAM.
  INFO  2022-10-18 10:54:54 MarkDuplicates  Read    15,000,000 records.  Elapsed time: 00:01:07s.  Time for last 1,000,000:    4s.  Last read position: chr3:52,561,677
  INFO  2022-10-18 10:54:54 MarkDuplicates  Tracking 72825 as yet unmatched pairs. 4569 records in RAM.
  INFO  2022-10-18 10:54:58 MarkDuplicates  Read    16,000,000 records.  Elapsed time: 00:01:12s.  Time for last 1,000,000:    4s.  Last read position: chr3:122,076,905
  INFO  2022-10-18 10:54:58 MarkDuplicates  Tracking 76914 as yet unmatched pairs. 2482 records in RAM.
  INFO  2022-10-18 10:55:03 MarkDuplicates  Read    17,000,000 records.  Elapsed time: 00:01:16s.  Time for last 1,000,000:    4s.  Last read position: chr3:169,764,978
  INFO  2022-10-18 10:55:03 MarkDuplicates  Tracking 80808 as yet unmatched pairs. 1357 records in RAM.
  INFO  2022-10-18 10:55:09 MarkDuplicates  Read    18,000,000 records.  Elapsed time: 00:01:23s.  Time for last 1,000,000:    6s.  Last read position: chr3:195,674,026
  INFO  2022-10-18 10:55:09 MarkDuplicates  Tracking 83573 as yet unmatched pairs. 336 records in RAM.
  [Tue Oct 18 10:55:11 GMT 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 1.44 minutes.
  Runtime.totalMemory()=5460983808
  To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
  htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  3: RGHD789:1546605/A
    at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
    at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:558)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

Work dir:
  /home/kai/ukd/projects/TSO500/project-dev-TSO500-extwf-test/nxf_sarek/run1_full/work/0a/63585815a3434e8db1c0ed8675d2d8

Relevant files

My config file looks like this:

params {
  max_memory = 44.GB
  max_cpus = 6
  max_time = 72.h
  // usage of targeted panel
  wes = true
  intervals = "TSO500_manifest_GRCh38.bed"
  //keep some intermediate files for testing
  save_bam_mapped = true
  save_trimmed = true
  save_split = true
  save_reference = true
}

singularity {
  enabled = true
  cacheDir = "/home/kai/software/nextflow_singularity_cache"  
}

// extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
process {
    withName: 'FASTQTOBAM' {
        ext.args         = '--extract-umis-from-read-names'
    }
}

And my samplesheet for this one sample X like this:

patient,sample,lane,status,fastq_1,fastq_2
HD789,HD789,L001,1, HD789-DNA_S2_L001_R1_001.fastq.gz,HD789-DNA_S2_L001_R2_001.fastq.gz
HD789,HD789,L002,1, HD789-DNA_S2_L002_R1_001.fastq.gz,HD789-DNA_S2_L002_R2_001.fastq.gz
HD789,HD789,L003,1,HD789-DNA_S2_L003_R1_001.fastq.gz,HD789-DNA_S2_L003_R2_001.fastq.gz
HD789,HD789,L004,1,HD789-DNA_S2_L004_R1_001.fastq.gz,HD789-DNA_S2_L004_R2_001.fastq.gz

System information

Nextflow: 22.04.5 Hardware: Desktop Executor: local Container engine: Singularity OS: Ubuntu 22.04 Sarek: 3.0.2

FriederikeHanssen commented 2 years ago

Hi @sci-kai ! It might not be necessary to mark duplicates depending on your setup https://nf-co.re/sarek/3.0.2/usage#how-to-handle-umis . As to if the fastq files should be merged beforehand maybe @lescai can comment, please also note that the whole umi subworkflow is currently being overhauled to match the latest recommendations by fgbio

sci-kai commented 2 years ago

Thanks Friederike for the hint! I tried running it with "skip_tools = 'baserecalibrator,markduplicates'" but it reports an error at the samblaster step within the CREATE_UMI_CONSENSUS module: Can you help me with that error?

Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'

Caused by:
  Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)

Command executed:

  samtools view -h  HD789-L004.bam | \
  samblaster -M --addMateTags | \
  samtools view  -Sb - >HD789-L004_unsorted_tagged.bam

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
      samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  samblaster: Version 0.1.26
  samblaster: Inputting from stdin
  samblaster: Outputting to stdout
  samblaster: Loaded 3366 header sequence entries.
  samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
  samblaster:    At location: *:0
  samblaster:    Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
  samblaster: Marked           0 of         51 (0.000%) total read ids as duplicates using 3288k memory in 0.000S CPU seconds and 0S wall time.
  samblaster: Premature exit (return code 1).
lescai commented 2 years ago

apologies I'll try to have a look as soon as possible

FriederikeHanssen commented 1 year ago

Hi @sci-kai ! Apologies for the delayed response, I was on vacation. This issue is a bit older but looks like your read files might be unmated: https://github.com/GregoryFaust/samblaster/issues/37 . Could you try name sorting the read files as suggested in the linked issue?

sci-kai commented 1 year ago

Hi @FriederikeHanssen. I tried to sort the input FASTQ files by read name using zcat reads.fq.gz \ | paste - - - - \ | sort -k1,1 -S 3G \ | tr '\t' '\n' \ | gzip > sorted/reads.fq.gz But still gives the same error message. In the local work directory the BAM file is also not sorted: samtools stats HD200-L002.bam | grep "is sorted:": SN is sorted: 0 However, when I run the command within the local work directory with samblaster v0.1.24 and samtools v1.6 (locally installed on our HPC) it successfully finishes:

> samtools view -h  HD200-L002.bam | \
> samblaster -M --addMateTags | \
> samtools view  -Sb - >HD200-L002_unsorted_tagged.bam ```
samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Marked 5957935 of 10699537 (55.68%) read ids as duplicates using 85164k memory in 19.209S CPU seconds and 4M40S(280S) wall time.
FriederikeHanssen commented 1 year ago

Reading over the issue again I am confused now. So in the original run (where MD failed), samblaster ran fine? But now it fails?

sci-kai commented 1 year ago

Hi @FriederikeHanssen,

sorry for the confusion and delay. Between the two runs I switched parameters from the command line into a separate configuration file. When using the configuration as described in the first post with "--skip_tools baserecalibrator, markduplicates" the UMI pipelines now runs successfully. However, if I move most parameters from the command line into a separate config file, I got the error in the samblaster step. I do not know which parameter causes this error, but it shouldn't make a difference when parameters are specified in config file or command line, right?

command line:

nextflow run ../nf-core-sarek-3.0.2/workflow/ \
--igenomes_base ../db/igenomes/references \
-profile singularity \
-c config/TSO500-hilbert-UMI.config \
-resume 

config file:

params {  
  //Workflow arguments
  input = 'config/samplesheet.csv' 
  outdir = 'results/' 
  tools = 'freebayes,mpileup,mutect2,strelka,manta,tiddit,merge'
  skip_tools = 'baserecalibrator,markduplicates'
  step = 'mapping'

  //References and cache
  genome = 'GATK.GRCh38'
  vep_cache = '../db/vep'
  snpeff_cache = '../db/snpeff'

  // UMI handling
  umi_read_structure = 'NA'
  group_by_umi_strategy = 'adjacency'

  // usage of targeted panel
  wes = true
  intervals = "../db/TSO500_manifest_GRCh38.bed"

  // Resources
  max_memory = 200.GB
  max_cpus = 32
  max_time = 72.h  
}

process {
        executor = 'pbspro'
        module = 'Singularity/3.7.1'
        queuesize = 100
        clusterOptions = '-A "ZPM"'

        // extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
        withName: 'FASTQTOBAM' {
        ext.args         = '--extract-umis-from-read-names'
    }
}

error

Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'

Caused by:
  Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)

Command executed:

  samtools view -h  HD789-L004.bam | \
  samblaster -M --addMateTags | \
  samtools view  -Sb - >HD789-L004_unsorted_tagged.bam

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
      samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  samblaster: Version 0.1.26
  samblaster: Inputting from stdin
  samblaster: Outputting to stdout
  samblaster: Loaded 3366 header sequence entries.
  samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
  samblaster:    At location: *:0
  samblaster:    Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
  samblaster: Marked           0 of         51 (0.000%) total read ids as duplicates using 1932k memory in 0.002S CPU seconds and 0S wall time.
  samblaster: Premature exit (return code 1).
  INFO:    Cleaning up image...
FriederikeHanssen commented 1 year ago

Hi! Just looking at this issue again. Actually might be completely unrelated to UMI steps, but params should never be provided with a config file but with a params file instead. There is some prioritization magic, where params provided in a config are not overwritten.

sci-kai commented 1 year ago

Hi, sorry for the late answer on this topic. My error messages are now completely solved, as I skipped the markduplicates step and also use a proper params file for starting the workflow (thanks for the hint). So the only open question left here is about the UMI consensus calling across the four files for @lescai :

I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. [...] In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.

gianfilippo commented 6 months ago

Hi, I am having the same issue. I have FASTQ files spread on multiple lanes, but, I believe, originating from the same library prep. I think it is ok to combine the FASTQ before processing in this instance. In general, would it not be better to add another variable to the sample sheet, like library preparation, and add a merge step for FASTQ files from the same sample, same library preparation and different lanes ? Best

lauren-tjoeka commented 3 months ago

@sci-kai Hi! Did you arrive at a solution? For a sample I was thinking of merging the umi-consensus bam files from e.g. 4 lanes and re-doing the workflow by using FGBIO_GROUPREADSBYUMI and FGBIO_CALLMOLECULARCONSENSUSREADS.