Open berguner opened 1 year ago
What was your command line when you run sarek?
Below is the part of my wrapper script setting up parameters and the command line:
PARAMS_YAML=./params.yml
echo "Creating parameters file: $PARAMS_YAML"
cat << EOF > $PARAMS_YAML
outdir : "${OUTDIR}"
tracedir : "${OUTDIR}/pipeline_info/"
input : "s3://input-bucket/preprocess_fastq_lists/${PROJECT}_${NGS_BATCH}.fastq_list.nf-core_sarek.csv"
multiqc_title : "${PROJECT} - ${NGS_BATCH}"
genome : "GATK.GRCh38"
nucleotides_per_second : 5000
split_fastq : 0
vep_cache : "s3://ref-bucket/ref/VEP/"
cf_chrom_len : "s3://out-bucket/ref/intervals/Homo_sapiens_assembly38.chromosomes_only_length.txt"
tools : "mutect2,cnvkit,vep,manta,tiddit"
skip_tools : "vcftools"
EOF
nextflow run nf-core/sarek -r 3.1.2 -c $NF_CONFIG -params-file $PARAMS_YAML -name "${PROJECT}_${NGS_BATCH}-${DATE_NOW}" -resume
since you didn't use the --wes
flag, cnvkit should have used "--method wgs --diagram --scatter"
cf:
https://github.com/nf-core/sarek/blob/c87f4eb694a7183e4f99c70fca0f1d4e91750b33/conf/modules/cnvkit.config#L42
Can you check the .command.sh
from the cnvkit process in the dedicated work folder?
I checked and saw that --method wgs
was used in the cnvkit.py batch
, however the reference.cnn
file has only 356
target regions. I dug a little deeper and found out that there are only 356
target regions in the wgs_calling_regions_noseconds.hg38.bed
file which is the default bed file for WGS runs.
Below are the contents of .command.sh
scripts of CNVkit tasks for one of the samples. I think the missing steps are cnvkit.py access
and cnvkit.py autobin
compared to cnvkit batch
steps listed in the documentation
#!/bin/bash -euo pipefail
cnvkit.py \
antitarget \
wgs_calling_regions_noseconds.hg38.bed \
--output igenomes.antitarget.bed \
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_ANTITARGET":
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
cnvkit.py \
reference \
--fasta Homo_sapiens_assembly38.fasta \
--targets wgs_calling_regions_noseconds.hg38.bed \
--antitargets igenomes.antitarget.bed \
--output cnvkit.reference.cnn \
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_REFERENCE":
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai WGS2139_1.recal.cram -@ 2 -o WGS2139_1.recal.bam
cnvkit.py \
batch \
WGS2139_1.recal.bam \
\
\
--reference cnvkit.reference.cnn \
\
--processes 2 \
--method wgs --diagram --scatter
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_TUMOR_ONLY_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
Hi @berguner ! Good to know, my understanding from this line in the docs "The pipeline executed by the batch command is equivalent to:" was that batch
includes the mentioned steps and acts as a wrapper around it. In your experience, should the rest of the steps also rather be run manually?
Hi @FriederikeHanssen , thanks for looking into this!
I just use cnvkit.py batch
to do all the steps in one go, so I don't think it's necessary to separate these steps into multiple tasks. I also save the reference file using --output-reference my_reference.cnn
file for later use, i.e with --cnvkit_reference parameter.
great, so then we can close the issue? :) Or is there anything open from your side?
Sorry for the late response @FriederikeHanssen. The issue still persists in the pipeline when it's run without the --cnvkit_reference parameter. I think this should be fixed at some point for completeness sake.
Hi, I am also experiencing this issue, how did you end up creating your reference @berguner?
Hi @grantn5, See my comment above: https://github.com/nf-core/sarek/issues/924#issuecomment-1477950531
It is slightly unclear what the exact issue is for me, it's probably been too long. Can you refresh my memory?
When the pipeline was run in WGS mode (without intervals), CNVkit reference was generated with the default bed file which only has the large genomic scaffolds. Running cnvkit.py access
and cnvkit.py autobin
in the CNVkit subworkflow might solve the issue.
Hey! Circling back to this also in the context of createpanelref pipeline. So my understanding from the docs now is, that we can actually omit the reference building in general and batch
should take care of everything depending on how the input is provided. are you also understanding it like that? I would still vote for havig the reference computation in the separate pipeline to somewhat keep the sarek contained :D
But reading this, I would interpret it, that for WGS tumor-only data we can run:
cnvkit.py -n *cram --method wgs --fasta fasta.fasta [--annotate refFlat.txt]
should output reference.cnn
And in general in sarek we can omit the precomputation of the target and antitarget.cnn and batch shoul dactually take care of it all
Hi @FriederikeHanssen ,
Yes, cnvkit.py batch
can generate the reference file but you need to run it with --output-reference reference.cnn
parameter. You probably need to provide at least one tumor bam/cram also for it to run through.
Hi @berguner to build the reference using on the batch
command you need to either A) provide a tumor bam/cram and normal bam/cram or b) just a normal bam/cram i.e batch
will not create a reference in tumor only mode
You are right @grantn5 , I meant to say that at least one tumor bam file is needed in addition to the normal bam file(s) like you pointed out in scenario A. I am not sure about the scenario B as I don't usually run CNVkit without normals.
Description of the bug
Hi, I ran the pipeline (
3.1.2
) for a bunch of tumor-only WGS samples. CNVkit reference bed files have both target and antitarget with improper bin counts:There should be many more (~1 million) target bins and no antitarget bins since these aren't WES samples and
--intervals
wasn't set. As a result, the copy number estimates are completely inaccurate. I suspect the reason is that CNVkit was run without--method wgs
.Command used and terminal output
No response
Relevant files
No response
System information
No response