nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
404 stars 412 forks source link

Customize Preprocessing based on each tool #830

Open berguner opened 2 years ago

berguner commented 2 years ago

Description of feature

Hi, It seems like the CNVkit workflow uses cram_recalibrated files as input here: https://github.com/nf-core/sarek/blob/bcd7bf9cb98cddec27bb54fb47ee122c09388c02/subworkflows/nf-core/variantcalling/cnvkit/main.nf#L8-L12. As far as I remember, recalibrated files of WES or panel samples don't contain off-target reads because base recalibration is applied over the intervals only. It would be better using CRAM files containing all the reads (cram_markduplicates ?) for CNVkit analysis for utilizing off-target reads. This is especially important for custom panels where there are fewer target regions compared to WES.

FriederikeHanssen commented 2 years ago

Hi! You can always achieve this by setting the parrameter --skip_tools baserecalibrator . I will add some docs on this.

berguner commented 2 years ago

But wouldn't that make the pipeline skip recalibration for SNV/indel calling also? I usually run the pipeline with --tools "mutect2,vep,cnvkit".

FriederikeHanssen commented 1 year ago

Yes, currently it is only possible to do one "type" of pre-processing.

I would transfer this to a bigger feature requests:

For scenarios such as above, it would be nice to allow different types of preprocessing. This would require tool based preprocessing steps, that ideally would still be customizable.

Such as:

md+ bqsr + haplotypecaller no md + bqsr + deepvariant md + no bqsr + cnvkit

(examples are completely made up)

This would llikely entail quite a massive change in how we manage data flow at the moment

FriederikeHanssen commented 1 year ago

Other current options as a work around:

Utilize the --step functions to run the one tool that needs different preprocessing on the respective csv file that is available in results/csv to avoid duplicate mapping for example and save time & resources