CNVkit reference issue for WGS tumor-only samples

berguner commented 1 year ago

Description of the bug

Hi, I ran the pipeline (3.1.2) for a bunch of tumor-only WGS samples. CNVkit reference bed files have both target and antitarget with improper bin counts:

wc -l *.bed
 1187 cnvkit.reference.antitarget-tmp.bed
  356 cnvkit.reference.target-tmp.bed

There should be many more (~1 million) target bins and no antitarget bins since these aren't WES samples and --intervals wasn't set. As a result, the copy number estimates are completely inaccurate. I suspect the reason is that CNVkit was run without --method wgs.

Command used and terminal output

No response

Relevant files

No response

System information

No response

maxulysse commented 1 year ago

What was your command line when you run sarek?

berguner commented 1 year ago

Below is the part of my wrapper script setting up parameters and the command line:

PARAMS_YAML=./params.yml
echo "Creating parameters file: $PARAMS_YAML"

cat << EOF > $PARAMS_YAML
outdir : "${OUTDIR}"
tracedir : "${OUTDIR}/pipeline_info/"
input : "s3://input-bucket/preprocess_fastq_lists/${PROJECT}_${NGS_BATCH}.fastq_list.nf-core_sarek.csv"
multiqc_title : "${PROJECT} - ${NGS_BATCH}"
genome : "GATK.GRCh38"
nucleotides_per_second : 5000
split_fastq : 0
vep_cache : "s3://ref-bucket/ref/VEP/"
cf_chrom_len : "s3://out-bucket/ref/intervals/Homo_sapiens_assembly38.chromosomes_only_length.txt"
tools : "mutect2,cnvkit,vep,manta,tiddit"
skip_tools : "vcftools"
EOF

nextflow run nf-core/sarek -r 3.1.2 -c $NF_CONFIG -params-file $PARAMS_YAML -name "${PROJECT}_${NGS_BATCH}-${DATE_NOW}" -resume

maxulysse commented 1 year ago

since you didn't use the --wes flag, cnvkit should have used "--method wgs --diagram --scatter" cf: https://github.com/nf-core/sarek/blob/c87f4eb694a7183e4f99c70fca0f1d4e91750b33/conf/modules/cnvkit.config#L42 Can you check the .command.sh from the cnvkit process in the dedicated work folder?

berguner commented 1 year ago

I checked and saw that --method wgs was used in the cnvkit.py batch, however the reference.cnn file has only 356 target regions. I dug a little deeper and found out that there are only 356 target regions in the wgs_calling_regions_noseconds.hg38.bed file which is the default bed file for WGS runs.

Below are the contents of .command.sh scripts of CNVkit tasks for one of the samples. I think the missing steps are cnvkit.py access and cnvkit.py autobin compared to cnvkit batch steps listed in the documentation

#!/bin/bash -euo pipefail
cnvkit.py \
    antitarget \
    wgs_calling_regions_noseconds.hg38.bed \
    --output igenomes.antitarget.bed \

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_ANTITARGET":
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS

#!/bin/bash -euo pipefail
cnvkit.py \
    reference \
    --fasta Homo_sapiens_assembly38.fasta \
    --targets wgs_calling_regions_noseconds.hg38.bed \
    --antitargets igenomes.antitarget.bed \
    --output cnvkit.reference.cnn \

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_REFERENCE":
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS

#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai WGS2139_1.recal.cram -@ 2 -o WGS2139_1.recal.bam

cnvkit.py \
    batch \
    WGS2139_1.recal.bam \
     \
     \
    --reference cnvkit.reference.cnn \
     \
    --processes 2 \
    --method wgs --diagram --scatter

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_TUMOR_ONLY_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
    samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS

FriederikeHanssen commented 1 year ago

Hi @berguner ! Good to know, my understanding from this line in the docs "The pipeline executed by the batch command is equivalent to:" was that batch includes the mentioned steps and acts as a wrapper around it. In your experience, should the rest of the steps also rather be run manually?

berguner commented 1 year ago

Hi @FriederikeHanssen , thanks for looking into this! I just use cnvkit.py batch to do all the steps in one go, so I don't think it's necessary to separate these steps into multiple tasks. I also save the reference file using --output-reference my_reference.cnn file for later use, i.e with --cnvkit_reference parameter.

FriederikeHanssen commented 1 year ago

great, so then we can close the issue? :) Or is there anything open from your side?

berguner commented 1 year ago

Sorry for the late response @FriederikeHanssen. The issue still persists in the pipeline when it's run without the --cnvkit_reference parameter. I think this should be fixed at some point for completeness sake.

grantn5 commented 1 year ago

Hi, I am also experiencing this issue, how did you end up creating your reference @berguner?

berguner commented 1 year ago

Hi @grantn5, See my comment above: https://github.com/nf-core/sarek/issues/924#issuecomment-1477950531

FriederikeHanssen commented 1 year ago

It is slightly unclear what the exact issue is for me, it's probably been too long. Can you refresh my memory?

berguner commented 1 year ago

When the pipeline was run in WGS mode (without intervals), CNVkit reference was generated with the default bed file which only has the large genomic scaffolds. Running cnvkit.py access and cnvkit.py autobin in the CNVkit subworkflow might solve the issue.

FriederikeHanssen commented 1 year ago

Hey! Circling back to this also in the context of createpanelref pipeline. So my understanding from the docs now is, that we can actually omit the reference building in general and batch should take care of everything depending on how the input is provided. are you also understanding it like that? I would still vote for havig the reference computation in the separate pipeline to somewhat keep the sarek contained :D

But reading this, I would interpret it, that for WGS tumor-only data we can run:

cnvkit.py -n *cram --method wgs --fasta fasta.fasta [--annotate refFlat.txt]

should output reference.cnn

And in general in sarek we can omit the precomputation of the target and antitarget.cnn and batch shoul dactually take care of it all

berguner commented 1 year ago

Hi @FriederikeHanssen , Yes, cnvkit.py batch can generate the reference file but you need to run it with --output-reference reference.cnn parameter. You probably need to provide at least one tumor bam/cram also for it to run through.

grantn5 commented 1 year ago

Hi @berguner to build the reference using on the batch command you need to either A) provide a tumor bam/cram and normal bam/cram or b) just a normal bam/cram i.e batch will not create a reference in tumor only mode

berguner commented 1 year ago

You are right @grantn5 , I meant to say that at least one tumor bam file is needed in addition to the normal bam file(s) like you pointed out in scenario A. I am not sure about the scenario B as I don't usually run CNVkit without normals.

FriederikeHanssen commented 3 months ago

Hey! @lescai did a lot of work recently on cnvkit. Can you confirm if this issue persists?

grantn5 commented 3 months ago

Hi @FriederikeHanssen I will look at it, can you please link the PR?

FriederikeHanssen commented 3 months ago

https://github.com/nf-core/sarek/pull/1502/files 🙏 Thank you @grantn5

berguner commented 3 months ago

Hi @FriederikeHanssen, I just tested version 3.4.3 and it generates ~500k bins when a tumor-normal pair is provided which makes sense. However, it only generates ~350 bins when run in tumor-only mode meaning that the issue persists. I dug into the code a little bit and found that in tumor-only mode, cnvkit.py reference is run with the option --targets wgs_calling_regions_noseconds.hg38.bed prior to running cnvkit.py batch and that bed file isn't split into bins. We can replace this bed file with a new one split into 1000bp bins as a quick fix or just warn the users that they should generate their own .cnn file and feed it to --cnvkit_reference if they have tumor-only WGS samples.

lescai commented 3 months ago

Hi there, I've read all the thread again (you've done a lot of checking - thanks!). Besides having a look into the use of a default bed when running tumour-only or germline mode WGS analysis, I believe the main issue here is a somehow not optimal use of the pipeline. Meaning: if I were to run a tumour-only or germline analysis, even when it's WGS and not WES, I'd create a background from normal samples first, and then feed it to sarek with the appropriate parameter, recently introduced into those two subworkflows. We still should review why the default bed is passed as target for those subworkflows, but probably that issue would not appear anyway if reference was created separately as @FriederikeHanssen mentioned. Perhaps we should highlight this in the docs, although it might be obvious for one working on CNVs.

berguner commented 3 months ago

Yeah, users should provide a reference created from a pool of protocol-matched normal samples when they don't have matched normal samples. I would even say that we should enforce this and disable CNVkit in tumor-only mode if a --cnvkit_reference wasn't provided. This would probably save inexperienced users from compute time and downstream analysis headaches. Of course we should explain this clearly in the documentation.

grantn5 commented 2 months ago

Yeah, users should provide a reference created from a pool of protocol-matched normal samples when they don't have matched normal samples. I would even say that we should enforce this and disable CNVkit in tumor-only mode if a --cnvkit_reference wasn't provided. This would probably save inexperienced users from compute time and downstream analysis headaches. Of course we should explain this clearly in the documentation.

I agree with @berguner and @lescai Users running CNVkit in tumor-only mode should always provide a reference even if it isn't matched as per the CNVkit documentation. As the segmentation will be far more accurate and as pointed out will not end up with spurious bins

nf-core / sarek