Open sci-kai opened 2 years ago
Hi @sci-kai ! It might not be necessary to mark duplicates depending on your setup https://nf-co.re/sarek/3.0.2/usage#how-to-handle-umis . As to if the fastq files should be merged beforehand maybe @lescai can comment, please also note that the whole umi subworkflow is currently being overhauled to match the latest recommendations by fgbio
Thanks Friederike for the hint! I tried running it with "skip_tools = 'baserecalibrator,markduplicates'" but it reports an error at the samblaster step within the CREATE_UMI_CONSENSUS module: Can you help me with that error?
Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'
Caused by:
Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)
Command executed:
samtools view -h HD789-L004.bam | \
samblaster -M --addMateTags | \
samtools view -Sb - >HD789-L004_unsorted_tagged.bam
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
samblaster: At location: *:0
samblaster: Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked 0 of 51 (0.000%) total read ids as duplicates using 3288k memory in 0.000S CPU seconds and 0S wall time.
samblaster: Premature exit (return code 1).
apologies I'll try to have a look as soon as possible
Hi @sci-kai ! Apologies for the delayed response, I was on vacation. This issue is a bit older but looks like your read files might be unmated: https://github.com/GregoryFaust/samblaster/issues/37 . Could you try name sorting the read files as suggested in the linked issue?
Hi @FriederikeHanssen.
I tried to sort the input FASTQ files by read name using
zcat reads.fq.gz \ | paste - - - - \ | sort -k1,1 -S 3G \ | tr '\t' '\n' \ | gzip > sorted/reads.fq.gz
But still gives the same error message.
In the local work directory the BAM file is also not sorted:
samtools stats HD200-L002.bam | grep "is sorted:": SN is sorted: 0
However, when I run the command within the local work directory with samblaster v0.1.24 and samtools v1.6 (locally installed on our HPC) it successfully finishes:
> samtools view -h HD200-L002.bam | \
> samblaster -M --addMateTags | \
> samtools view -Sb - >HD200-L002_unsorted_tagged.bam ```
samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Marked 5957935 of 10699537 (55.68%) read ids as duplicates using 85164k memory in 19.209S CPU seconds and 4M40S(280S) wall time.
Reading over the issue again I am confused now. So in the original run (where MD failed), samblaster ran fine? But now it fails?
Hi @FriederikeHanssen,
sorry for the confusion and delay. Between the two runs I switched parameters from the command line into a separate configuration file. When using the configuration as described in the first post with "--skip_tools baserecalibrator, markduplicates" the UMI pipelines now runs successfully. However, if I move most parameters from the command line into a separate config file, I got the error in the samblaster step. I do not know which parameter causes this error, but it shouldn't make a difference when parameters are specified in config file or command line, right?
command line:
nextflow run ../nf-core-sarek-3.0.2/workflow/ \
--igenomes_base ../db/igenomes/references \
-profile singularity \
-c config/TSO500-hilbert-UMI.config \
-resume
config file:
params {
//Workflow arguments
input = 'config/samplesheet.csv'
outdir = 'results/'
tools = 'freebayes,mpileup,mutect2,strelka,manta,tiddit,merge'
skip_tools = 'baserecalibrator,markduplicates'
step = 'mapping'
//References and cache
genome = 'GATK.GRCh38'
vep_cache = '../db/vep'
snpeff_cache = '../db/snpeff'
// UMI handling
umi_read_structure = 'NA'
group_by_umi_strategy = 'adjacency'
// usage of targeted panel
wes = true
intervals = "../db/TSO500_manifest_GRCh38.bed"
// Resources
max_memory = 200.GB
max_cpus = 32
max_time = 72.h
}
process {
executor = 'pbspro'
module = 'Singularity/3.7.1'
queuesize = 100
clusterOptions = '-A "ZPM"'
// extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
withName: 'FASTQTOBAM' {
ext.args = '--extract-umis-from-read-names'
}
}
error
Error executing process > 'NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)'
Caused by:
Process `NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER (HD789-L004)` terminated with an error exit status (1)
Command executed:
samtools view -h HD789-L004.bam | \
samblaster -M --addMateTags | \
samtools view -Sb - >HD789-L004_unsorted_tagged.bam
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:CREATE_UMI_CONSENSUS:SAMBLASTER":
samblaster: $( samblaster -h 2>&1 | head -n 1 | sed 's/^samblaster: Version //' )
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
INFO: Converting SIF file to temporary sandbox...
samblaster: Version 0.1.26
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 3366 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 2 for id: NB552271:97:HWFCVBGXL:4:11401:6367:1054:NCTAGAT+TCGGAGA
samblaster: At location: *:0
samblaster: Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked 0 of 51 (0.000%) total read ids as duplicates using 1932k memory in 0.002S CPU seconds and 0S wall time.
samblaster: Premature exit (return code 1).
INFO: Cleaning up image...
Hi! Just looking at this issue again. Actually might be completely unrelated to UMI steps, but params should never be provided with a config file but with a params file instead. There is some prioritization magic, where params provided in a config are not overwritten.
Hi, sorry for the late answer on this topic.
My error messages are now completely solved, as I skipped the markduplicates
step and also use a proper params file for starting the workflow (thanks for the hint).
So the only open question left here is about the UMI consensus calling across the four files for @lescai :
I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. [...] In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.
Hi, I am having the same issue. I have FASTQ files spread on multiple lanes, but, I believe, originating from the same library prep. I think it is ok to combine the FASTQ before processing in this instance. In general, would it not be better to add another variable to the sample sheet, like library preparation, and add a merge step for FASTQ files from the same sample, same library preparation and different lanes ? Best
@sci-kai Hi! Did you arrive at a solution? For a sample I was thinking of merging the umi-consensus bam files from e.g. 4 lanes and re-doing the workflow by using FGBIO_GROUPREADSBYUMI and FGBIO_CALLMOLECULARCONSENSUSREADS.
Description of the bug
I am using Sarek with a configuration that reads UMIs from the FASTQ read names. The workflow fails at the "MarkDuplicates" step with the error message:
htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: RGHD789:806050
According to the documentation of this error, it can result from reads having the same read name in the BAM file. I have four lanes for one sample and realized that UMI consensus calling is performed for each lane separately. During that process, the read names are changed to a scheme containing the sample name HD789 and a continuous number. When BAMs for all four lanes are merged, this results into different reads having the same name (4x readpairs since I have 4 lanes) that results into this error in the MarkDuplicates step.In general. I was wondering if it is intended to call UMI consensus for each lane separately or whether it should be called on a merged file, as the same UMI could be distributed across multiple lanes.
Command used and terminal output
Relevant files
My config file looks like this:
And my samplesheet for this one sample X like this:
System information
Nextflow: 22.04.5 Hardware: Desktop Executor: local Container engine: Singularity OS: Ubuntu 22.04 Sarek: 3.0.2