Open berguner opened 1 year ago
What was your command line when you run sarek?
Below is the part of my wrapper script setting up parameters and the command line:
PARAMS_YAML=./params.yml
echo "Creating parameters file: $PARAMS_YAML"
cat << EOF > $PARAMS_YAML
outdir : "${OUTDIR}"
tracedir : "${OUTDIR}/pipeline_info/"
input : "s3://input-bucket/preprocess_fastq_lists/${PROJECT}_${NGS_BATCH}.fastq_list.nf-core_sarek.csv"
multiqc_title : "${PROJECT} - ${NGS_BATCH}"
genome : "GATK.GRCh38"
nucleotides_per_second : 5000
split_fastq : 0
vep_cache : "s3://ref-bucket/ref/VEP/"
cf_chrom_len : "s3://out-bucket/ref/intervals/Homo_sapiens_assembly38.chromosomes_only_length.txt"
tools : "mutect2,cnvkit,vep,manta,tiddit"
skip_tools : "vcftools"
EOF
nextflow run nf-core/sarek -r 3.1.2 -c $NF_CONFIG -params-file $PARAMS_YAML -name "${PROJECT}_${NGS_BATCH}-${DATE_NOW}" -resume
since you didn't use the --wes
flag, cnvkit should have used "--method wgs --diagram --scatter"
cf:
https://github.com/nf-core/sarek/blob/c87f4eb694a7183e4f99c70fca0f1d4e91750b33/conf/modules/cnvkit.config#L42
Can you check the .command.sh
from the cnvkit process in the dedicated work folder?
I checked and saw that --method wgs
was used in the cnvkit.py batch
, however the reference.cnn
file has only 356
target regions. I dug a little deeper and found out that there are only 356
target regions in the wgs_calling_regions_noseconds.hg38.bed
file which is the default bed file for WGS runs.
Below are the contents of .command.sh
scripts of CNVkit tasks for one of the samples. I think the missing steps are cnvkit.py access
and cnvkit.py autobin
compared to cnvkit batch
steps listed in the documentation
#!/bin/bash -euo pipefail
cnvkit.py \
antitarget \
wgs_calling_regions_noseconds.hg38.bed \
--output igenomes.antitarget.bed \
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_ANTITARGET":
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
cnvkit.py \
reference \
--fasta Homo_sapiens_assembly38.fasta \
--targets wgs_calling_regions_noseconds.hg38.bed \
--antitargets igenomes.antitarget.bed \
--output cnvkit.reference.cnn \
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_REFERENCE":
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai WGS2139_1.recal.cram -@ 2 -o WGS2139_1.recal.bam
cnvkit.py \
batch \
WGS2139_1.recal.bam \
\
\
--reference cnvkit.reference.cnn \
\
--processes 2 \
--method wgs --diagram --scatter
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_TUMOR_ONLY_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
Hi @berguner ! Good to know, my understanding from this line in the docs "The pipeline executed by the batch command is equivalent to:" was that batch
includes the mentioned steps and acts as a wrapper around it. In your experience, should the rest of the steps also rather be run manually?
Hi @FriederikeHanssen , thanks for looking into this!
I just use cnvkit.py batch
to do all the steps in one go, so I don't think it's necessary to separate these steps into multiple tasks. I also save the reference file using --output-reference my_reference.cnn
file for later use, i.e with --cnvkit_reference parameter.
great, so then we can close the issue? :) Or is there anything open from your side?
Sorry for the late response @FriederikeHanssen. The issue still persists in the pipeline when it's run without the --cnvkit_reference parameter. I think this should be fixed at some point for completeness sake.
Hi, I am also experiencing this issue, how did you end up creating your reference @berguner?
Hi @grantn5, See my comment above: https://github.com/nf-core/sarek/issues/924#issuecomment-1477950531
It is slightly unclear what the exact issue is for me, it's probably been too long. Can you refresh my memory?
When the pipeline was run in WGS mode (without intervals), CNVkit reference was generated with the default bed file which only has the large genomic scaffolds. Running cnvkit.py access
and cnvkit.py autobin
in the CNVkit subworkflow might solve the issue.
Hey! Circling back to this also in the context of createpanelref pipeline. So my understanding from the docs now is, that we can actually omit the reference building in general and batch
should take care of everything depending on how the input is provided. are you also understanding it like that? I would still vote for havig the reference computation in the separate pipeline to somewhat keep the sarek contained :D
But reading this, I would interpret it, that for WGS tumor-only data we can run:
cnvkit.py -n *cram --method wgs --fasta fasta.fasta [--annotate refFlat.txt]
should output reference.cnn
And in general in sarek we can omit the precomputation of the target and antitarget.cnn and batch shoul dactually take care of it all
Hi @FriederikeHanssen ,
Yes, cnvkit.py batch
can generate the reference file but you need to run it with --output-reference reference.cnn
parameter. You probably need to provide at least one tumor bam/cram also for it to run through.
Hi @berguner to build the reference using on the batch
command you need to either A) provide a tumor bam/cram and normal bam/cram or b) just a normal bam/cram i.e batch
will not create a reference in tumor only mode
You are right @grantn5 , I meant to say that at least one tumor bam file is needed in addition to the normal bam file(s) like you pointed out in scenario A. I am not sure about the scenario B as I don't usually run CNVkit without normals.
Hey! @lescai did a lot of work recently on cnvkit. Can you confirm if this issue persists?
Hi @FriederikeHanssen I will look at it, can you please link the PR?
https://github.com/nf-core/sarek/pull/1502/files 🙏 Thank you @grantn5
Hi @FriederikeHanssen,
I just tested version 3.4.3 and it generates ~500k bins when a tumor-normal pair is provided which makes sense. However, it only generates ~350 bins when run in tumor-only mode meaning that the issue persists. I dug into the code a little bit and found that in tumor-only mode, cnvkit.py reference
is run with the option --targets wgs_calling_regions_noseconds.hg38.bed
prior to running cnvkit.py batch
and that bed file isn't split into bins. We can replace this bed file with a new one split into 1000bp bins as a quick fix or just warn the users that they should generate their own .cnn
file and feed it to --cnvkit_reference
if they have tumor-only WGS samples.
Hi there, I've read all the thread again (you've done a lot of checking - thanks!). Besides having a look into the use of a default bed when running tumour-only or germline mode WGS analysis, I believe the main issue here is a somehow not optimal use of the pipeline. Meaning: if I were to run a tumour-only or germline analysis, even when it's WGS and not WES, I'd create a background from normal samples first, and then feed it to sarek with the appropriate parameter, recently introduced into those two subworkflows. We still should review why the default bed is passed as target for those subworkflows, but probably that issue would not appear anyway if reference was created separately as @FriederikeHanssen mentioned. Perhaps we should highlight this in the docs, although it might be obvious for one working on CNVs.
Yeah, users should provide a reference created from a pool of protocol-matched normal samples when they don't have matched normal samples. I would even say that we should enforce this and disable CNVkit in tumor-only mode if a --cnvkit_reference
wasn't provided. This would probably save inexperienced users from compute time and downstream analysis headaches. Of course we should explain this clearly in the documentation.
Yeah, users should provide a reference created from a pool of protocol-matched normal samples when they don't have matched normal samples. I would even say that we should enforce this and disable CNVkit in tumor-only mode if a
--cnvkit_reference
wasn't provided. This would probably save inexperienced users from compute time and downstream analysis headaches. Of course we should explain this clearly in the documentation.
I agree with @berguner and @lescai Users running CNVkit in tumor-only mode should always provide a reference even if it isn't matched as per the CNVkit documentation. As the segmentation will be far more accurate and as pointed out will not end up with spurious bins
Description of the bug
Hi, I ran the pipeline (
3.1.2
) for a bunch of tumor-only WGS samples. CNVkit reference bed files have both target and antitarget with improper bin counts:There should be many more (~1 million) target bins and no antitarget bins since these aren't WES samples and
--intervals
wasn't set. As a result, the copy number estimates are completely inaccurate. I suspect the reason is that CNVkit was run without--method wgs
.Command used and terminal output
No response
Relevant files
No response
System information
No response