nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
383 stars 401 forks source link

Add option for outputting gvcfs instead of vcfs #908

Open asp8200 opened 1 year ago

asp8200 commented 1 year ago

Description of feature

It might be useful to add a CLI-option for outputting gvcf-files instead of vcf-files. It seems that with the current version that is only possible when running --joint_germline. (However, I could be wrong).

FriederikeHanssen commented 1 year ago

No that is correct, mostly based on some discussions we had with @nickhsmith that for most the final ones would be the most interesting . as well as the instructions for the single sample workflow by GATK here

We would have to see how we process with single sample filtering then. However, since that step seems to cause quite some issues for some, because of the time requirements it has, it might be good to also allow skipping it with a parameter. we could probably combine then both new parameters here.

asp8200 commented 1 year ago

Some work has now been done in this direction.

The new sentieon-haplotyper-flow is able to output vcf, gvcf or both:

https://github.com/nf-core/sarek/pull/1007

amizeranschi commented 1 year ago

Is there any update on this regarding GATK? There was some recent progress on a related issue, but that was also related to joint_germline, and not the single sample workflow: https://github.com/nf-core/sarek/issues/764

FriederikeHanssen commented 1 year ago

No, when running GATK in single sample it is not using -ERC GVCF, so no gvcf are produced as according to the gatk docs:https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- . You could still force it with a custom config, but not sure if we should at it in by default

IsmailM commented 1 year ago

Thanks for the update.

I agree it probably shouldn't be the default, but it would be nice if it could be created if using a flag (--generate_gvcf etc.).

Would you mind providing details (or pointing me in the right direction/docs) on how to create a custom config to force the gvcf generation?

FriederikeHanssen commented 1 year ago

sure, this should work:

process {

     withName: GATK4_HAPLOTYPECALLER {
       ext.args = "-ERC GVCF"
      ext.prefix = { meta.num_intervals <= 1 ? "${meta.id}.haplotypecaller.g" : "${meta.id}.haplotypecaller.${intervals.simpleName}.g"  }
     }

      withName: 'MERGE_HAPLOTYPECALLER' {
        ext.prefix       = { "${meta.id}.haplotypecaller.g" }
    }
}

should work. single sample is by default enabled (CNNScoreVariants/FilterVarianttranches), but you can skip it if you'd like

IsmailM commented 11 months ago

Hi Friederike,

Many thanks for your help here.

We managed to produce the gVCF successfully with your custom config.

However, even though the CNNScoreVariants finishes running (after 2 days), it does not seem to doesn't actually work, so the next step fails with a pretty telling error:

02:13:39.663 WARN  IntelInflater - Zero Bytes Written : 0
02:13:39.676 INFO  FilterVariantTranches - Finished pass 0 through the variants
02:13:39.676 INFO  FilterVariantTranches - Found 0 SNPs and 0 indels with INFO score key:CNN_1D.
02:13:39.676 INFO  FilterVariantTranches - Found 0 SNPs and 4894774 indels in the resources.
02:13:39.677 INFO  FilterVariantTranches - Filtered 0 SNPs out of 0 and filtered 0 indels out of 0 with INFO score: CNN_1D.
02:13:39.681 INFO  FilterVariantTranches - Shutting down engine
[October 6, 2023 at 2:13:39 AM GMT] org.broadinstitute.hellbender.tools.walkers.vqsr.FilterVariantTranches done. Elapsed time: 29.85 minutes.
Runtime.totalMemory()=2281701376
***********************************************************************

A USER ERROR has occurred: Bad input: VCF contains no variants or no variants with INFO score key "CNN_1D"

Any recommendations/hints here would be really appreciated.

(I have had a look at the documentation/google for CNNScoreVariants and there doesn't seem to be an option for gVcf (or any hints on how to run with gvcf).

(or should we simply skip the CNNScoreVariants / FilterVarianttranches steps)?

FriederikeHanssen commented 11 months ago

I guess CNNScoreVariants doesn't like gvcf. Yes you can skip by setting --skip_tools haplotypecaller_filter. You could ask on GATk forum if there is such an optionm would be happy to add it .