nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
351 stars 388 forks source link

CNVkit reference issue for WGS tumor-only samples #924

Open berguner opened 1 year ago

berguner commented 1 year ago

Description of the bug

Hi, I ran the pipeline (3.1.2) for a bunch of tumor-only WGS samples. CNVkit reference bed files have both target and antitarget with improper bin counts:

wc -l *.bed
 1187 cnvkit.reference.antitarget-tmp.bed
  356 cnvkit.reference.target-tmp.bed

There should be many more (~1 million) target bins and no antitarget bins since these aren't WES samples and --intervals wasn't set. As a result, the copy number estimates are completely inaccurate. I suspect the reason is that CNVkit was run without --method wgs.

Command used and terminal output

No response

Relevant files

No response

System information

No response

maxulysse commented 1 year ago

What was your command line when you run sarek?

berguner commented 1 year ago

Below is the part of my wrapper script setting up parameters and the command line:

PARAMS_YAML=./params.yml
echo "Creating parameters file: $PARAMS_YAML"

cat << EOF > $PARAMS_YAML
outdir : "${OUTDIR}"
tracedir : "${OUTDIR}/pipeline_info/"
input : "s3://input-bucket/preprocess_fastq_lists/${PROJECT}_${NGS_BATCH}.fastq_list.nf-core_sarek.csv"
multiqc_title : "${PROJECT} - ${NGS_BATCH}"
genome : "GATK.GRCh38"
nucleotides_per_second : 5000
split_fastq : 0
vep_cache : "s3://ref-bucket/ref/VEP/"
cf_chrom_len : "s3://out-bucket/ref/intervals/Homo_sapiens_assembly38.chromosomes_only_length.txt"
tools : "mutect2,cnvkit,vep,manta,tiddit"
skip_tools : "vcftools"
EOF

nextflow run nf-core/sarek -r 3.1.2 -c $NF_CONFIG -params-file $PARAMS_YAML -name "${PROJECT}_${NGS_BATCH}-${DATE_NOW}" -resume
maxulysse commented 1 year ago

since you didn't use the --wes flag, cnvkit should have used "--method wgs --diagram --scatter" cf: https://github.com/nf-core/sarek/blob/c87f4eb694a7183e4f99c70fca0f1d4e91750b33/conf/modules/cnvkit.config#L42 Can you check the .command.sh from the cnvkit process in the dedicated work folder?

berguner commented 1 year ago

I checked and saw that --method wgs was used in the cnvkit.py batch, however the reference.cnn file has only 356 target regions. I dug a little deeper and found out that there are only 356 target regions in the wgs_calling_regions_noseconds.hg38.bed file which is the default bed file for WGS runs.

Below are the contents of .command.sh scripts of CNVkit tasks for one of the samples. I think the missing steps are cnvkit.py access and cnvkit.py autobin compared to cnvkit batch steps listed in the documentation

#!/bin/bash -euo pipefail
cnvkit.py \
    antitarget \
    wgs_calling_regions_noseconds.hg38.bed \
    --output igenomes.antitarget.bed \

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_ANTITARGET":
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
cnvkit.py \
    reference \
    --fasta Homo_sapiens_assembly38.fasta \
    --targets wgs_calling_regions_noseconds.hg38.bed \
    --antitargets igenomes.antitarget.bed \
    --output cnvkit.reference.cnn \

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:PREPARE_REFERENCE_CNVKIT:CNVKIT_REFERENCE":
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai WGS2139_1.recal.cram -@ 2 -o WGS2139_1.recal.bam

cnvkit.py \
    batch \
    WGS2139_1.recal.bam \
     \
     \
    --reference cnvkit.reference.cnn \
     \
    --processes 2 \
    --method wgs --diagram --scatter

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_TUMOR_ONLY_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
    samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
    cnvkit: $(cnvkit.py version | sed -e "s/cnvkit v//g")
END_VERSIONS
FriederikeHanssen commented 1 year ago

Hi @berguner ! Good to know, my understanding from this line in the docs "The pipeline executed by the batch command is equivalent to:" was that batch includes the mentioned steps and acts as a wrapper around it. In your experience, should the rest of the steps also rather be run manually?

berguner commented 1 year ago

Hi @FriederikeHanssen , thanks for looking into this! I just use cnvkit.py batch to do all the steps in one go, so I don't think it's necessary to separate these steps into multiple tasks. I also save the reference file using --output-reference my_reference.cnn file for later use, i.e with --cnvkit_reference parameter.

FriederikeHanssen commented 1 year ago

great, so then we can close the issue? :) Or is there anything open from your side?

berguner commented 1 year ago

Sorry for the late response @FriederikeHanssen. The issue still persists in the pipeline when it's run without the --cnvkit_reference parameter. I think this should be fixed at some point for completeness sake.

grantn5 commented 1 year ago

Hi, I am also experiencing this issue, how did you end up creating your reference @berguner?

berguner commented 1 year ago

Hi @grantn5, See my comment above: https://github.com/nf-core/sarek/issues/924#issuecomment-1477950531

FriederikeHanssen commented 1 year ago

It is slightly unclear what the exact issue is for me, it's probably been too long. Can you refresh my memory?

berguner commented 1 year ago

When the pipeline was run in WGS mode (without intervals), CNVkit reference was generated with the default bed file which only has the large genomic scaffolds. Running cnvkit.py access and cnvkit.py autobin in the CNVkit subworkflow might solve the issue.

FriederikeHanssen commented 11 months ago

Hey! Circling back to this also in the context of createpanelref pipeline. So my understanding from the docs now is, that we can actually omit the reference building in general and batch should take care of everything depending on how the input is provided. are you also understanding it like that? I would still vote for havig the reference computation in the separate pipeline to somewhat keep the sarek contained :D

But reading this, I would interpret it, that for WGS tumor-only data we can run:

cnvkit.py -n *cram --method wgs --fasta fasta.fasta [--annotate refFlat.txt]

should output reference.cnn

And in general in sarek we can omit the precomputation of the target and antitarget.cnn and batch shoul dactually take care of it all

berguner commented 11 months ago

Hi @FriederikeHanssen , Yes, cnvkit.py batch can generate the reference file but you need to run it with --output-reference reference.cnn parameter. You probably need to provide at least one tumor bam/cram also for it to run through.

grantn5 commented 10 months ago

Hi @berguner to build the reference using on the batch command you need to either A) provide a tumor bam/cram and normal bam/cram or b) just a normal bam/cram i.e batch will not create a reference in tumor only mode

berguner commented 10 months ago

You are right @grantn5 , I meant to say that at least one tumor bam file is needed in addition to the normal bam file(s) like you pointed out in scenario A. I am not sure about the scenario B as I don't usually run CNVkit without normals.