nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
351 stars 388 forks source link

Output files for CNVkit incorrect #1251

Open parth2608 opened 9 months ago

parth2608 commented 9 months ago

Description of the bug

Hello, I am a newbie to nf-core sarek and I am running it on WES data for tumor-only, using CNVkit tool. Even though the pipeline completes successfully, the CNVkit tool directory in the variant_calling directory does not produce all the required outputs.

The output files generated in my case are:

I have provided the command that I run.

Command used and terminal output

nextflow run nf-core-sarek-3.2.3/workflow --input design.csv --outdir output/ --genome GATK.GRCh38 -profile singularity --wes -c nfcore-RT-wes.config --tools cnvkit --intervals S03723314_Padded_hg38_Agi_v4_PDMR_3_column_only_primary.bed

Relevant files

No response

System information

achakroun commented 8 months ago

Hello, I have the same issue with a couple of missing cnvkit output files knowing that the pipeline has completed successfully. Currently, I am processing the cnvkit manually to end up with the cnvkit calls but it don't really know how to make things done by sarek. Any help?

mhjiang97 commented 8 months ago

Hi, Parth and achakroun!

I am quite sure that this issue you've encountered with is related to the bed file you provide for sarek. So, in a nutshell, provide for sarek a bed file with gene symbols. Hope this can help you to obtain full CNVkit output files.

FriederikeHanssen commented 8 months ago

we need more info. Can you please add the .nextflow.log, any custom config files, any parameter files, an overview of the published files, and what files are missing?

achakroun commented 8 months ago

[Uploading .nextflow.log…]()

Sure: Please find attachent .nextflow.log knowing that no custom config files were used. The command was as follow:

nextflow run nf-core/sarek -profile singularity -resume --max_cpus 10 --max_memory 40.GB --input ./EN00003548_HN00205087.csv --trim_fastq --genome GATK.GRCh38 --save_reference --outdir ./results/EN00003548_HN00205087 --wes --intervals /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/Twist_Exome_RefSeq_targets_hg38_200-pad.bed --tools deepvariant,haplotypecaller,freebayes,strelka,manta,tiddit,cnvkit

Output files: sample.antitargetcoverage.cnn sample.targetcoverage.cnn sample.cnr reference.cnn Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed

Missing files: sample-diagram.pdf sample-scatter.png sample.bintest.cns sample.cns sample.call.cns

The last one is the most important because it is just one step before vcf.

Thanks a lot

achakroun commented 8 months ago

.nextflow.log I am not sure that the log was correctly uploaded so here it is again

achakroun commented 8 months ago

Hello everyone, @mhjiang97 I added gene symbols to my interval list bed file and it didn't work. The pipeline completed successfully but expected cnvkit output files still missing. Attached the new .nextflow.log .nextflow.log

achakroun commented 8 months ago

Hello,

I am thinking to use my own targets and reference files specifically with cnvkit (maybe it will resolve the issue of the missing outputs). However, as these are paths and not params, how can I pass them to the pipeline? I tried -params-file with the following yaml:

process { withName: CNVKIT_BATCH { ext.args = "targets /path/to/my_cnvkit_targets.bed" ext.args = "reference /path/to/my_reference.cnn" } }

and it ended up with:

Cannot parse params file: cnvkit-twist-GRCh38.yaml

Since I am not familiar with yaml, I am sure my params file is incorrectly set. I am really sorry.

maxulysse commented 8 months ago

can you try passing this to the sarek command line: --cnvkit_reference /path/to/my_reference.cnn

achakroun commented 8 months ago

Ok, please what about the targets. I have tried:

--cnvkit_targets /path/to/my_cnvkit_targets.bed

and it ended up with:

WARN: The following invalid input values have been detected:

* --cnvkit_targets: /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/cnvkit_targets.bed
maxulysse commented 8 months ago

try with --intervals cf: https://nf-co.re/sarek/3.3.2/parameters#intervals

achakroun commented 8 months ago

Hello,

@maxulysse thanks a lot for the hint. It worked but I still need to tweak a bit using the cnvkit parameters. Please, could you let me know where can I find more information about the usage of cnvkit with sarek (e.g. where was the "--cnv_reference" argument described and is there other similar arguments we can use for cnvkit through sarek?)

Regards.

maxulysse commented 8 months ago

Check https://nf-co.re/sarek/dev/parameters, and be sure to check the show hidden on the right hand side to see all parameters.

achakroun commented 4 months ago

Hello, Sorry for posting again but unfortunately I thought the issue was solved but it turns out that it doesn't. Actually, I still have the same missing files with cnvkit, meaning that the output files I have are as follow:

sample.antitargetcoverage.cnn

sample.targetcoverage.cnn
sample.cnr
reference.cnn
Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed
Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed

And the following files are missing:

sample-diagram.pdf
sample-scatter.png
sample.bintest.cns
sample.cns
sample.call.cns

Here is the command I run:

nextflow run nf-core/sarek \
    -profile singularity \
    -resume \
    --wes \
    --intervals cnvkit_targets.bed \
    --step variant_calling \
    --input ./015-24.csv \
    --genome GATK.GRCh38 \
    --outdir ./results/015-24_testing \
    --tools cnvkit \
    --cnvkit_reference my_reference.cnn

When I look into .command.sh corresponding to the process, it looks like the my_reference.cnn I am passing through --cnvkit_reference arg wasn't taken into account!

#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai 015-24.recal.cram -@ 2 -o 015-24.recal.bam
samtools index 015-24.recal.bam

cnvkit.py \
    batch \
    015-24.recal.bam \
    --normal  \
    --fasta Homo_sapiens_assembly38.fasta \
     \
    --targets cnvkit_targets.bed \
    --processes 2 \
    --method hybrid --diagram --scatter

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
    samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
    cnvkit: $(cnvkit.py version | sed -e 's/cnvkit v//g')
END_VERSIONS

Also, when I run the cnvkit.py batch command described in the .command.sh manually, I end up with the same missing files and the following warning in the STDOUT:

WARNING: Most antitarget bins (100.00%, 39620/39620) have low or no coverage; is this amplicon/WGS?
Antitargets are nan x more variable than targets

Do you have any suggestions? Thanks. A

FriederikeHanssen commented 4 months ago

Hey! Good news we are working on it @lescai has put in some time to clean this up and fix it :)

achakroun commented 4 months ago

Thanks a lot for the feedback @FriederikeHanssen. As I am already using cnvkit "manually", may be I can share the tools sequence that works perfectly for my germline wes-based cnv analysis, if you think it could be helpful. FYI, cnvkit batch doesn't work correctly in my case.

lescai commented 4 months ago

Hi @achakroun would you be able to share the log file you find in the work folder of your cnvkit batch process? I do have all the files you mentioned using sarek as is,

mysample.converted-diagram.pdf
mysample.converted-scatter.png
mysample.converted.antitargetcoverage.cnn
mysample.converted.bam
mysample.converted.bam.bai
mysample.converted.bintest.cns
mysample.converted.call.cns
mysample.converted.cnr
mysample.converted.cns
mysample.converted.cram -> [..]]
mysample.converted.targetcoverage.cnn
Homo_sapiens_assembly38.fasta -> [..]]
Homo_sapiens_assembly38.fasta.fai -> [..]]
reference.cnn
mytarget.antitarget.bed
mytarget.bed -> [..]]
mytarget.target.bed
versions.yml

my .command.sh looks exactly like the one you referenced above. so I'd need to have a look at the actual process execution. would be great if you could also share your .command.log content to make sure? cheers!

achakroun commented 4 months ago

Hi @lescai Please, find attached the .command.log from the cnvkit batch work folder and the .nextflow.log of the corresponding run:

.command.log .nextflow.log

Please, are you using your own intervals and cnn reference with cnvkit through the sarek argument --intervals and --cnvkit_reference? If so, is it OK that the cnvkit reference is not mentioned in .command.sh?

FYI, I am processing exclusively WES data for germline "normal" samples.

Please, do not hesitate to ask me for any further details. Looking forward to your feedback. Regards.

lescai commented 4 months ago

Thanks! I'll have a look and compare as soon as possible. the cnvkit_reference is precisely the issue I was meant to address in this fix, as in the current version the germline analysis does not use it.

achakroun commented 4 months ago

Actually, the workflow that works for us perfectly when using the standalone cnvkit is as follow:

1- cnvkit coverage with our _targets.bed 2- cnvkit coverage with our _antitargets.bed 3- cnvkit fix using our reference.cnn 4- cnvkit segment 5- cnvkit call with method set to cbs (which is default) 6- cnvkit export vcf

For the files _targets.bed, _antitargets.bed and reference.cnn, they are prepared once for all using the exome capture kit padded intervals as set by the manufacturer and a set (N=~60) of assumed "normal" samples as reference

Cheers.