statgen / topmed_variant_calling

Apache License 2.0
23 stars 3 forks source link

Please provide a way to run without requiring all CRAMs to be input at the same time #1

Closed wshands closed 5 years ago

wshands commented 6 years ago

In freeze3 of the variant caller the joint genotyping step required all CRAMs to be available on disk (or available via gcsfuse or s3fuse). This requirement is not compatible with some cloud infrastructure providers such as Broad's FireCloud and the workflow definition language (WDL) which do not support gcsfuse or s3fuse and which use VMs for tasks.

In particular when the joint genotyping step is implemented in WDL as a task a VM is started on a GCP instance and all inputs, which includes all CRAM files, are localized to the VM disk. In our case this required over 600 CRAM files to be localized to one VM resulting in very large disk space requirements and time to localize. This is not scalable for large number of CRAMs.

Instead what would be helpful is some sort of mode where joint variant calling can be run on one CRAM at a time to produce some sort of gVCF https://software.broadinstitute.org/gatk/documentation/article.php?id=3893 The joint variant calling would be scattered across multiple VMs that only had to localize one CRAM and produce a single gVCF.

hyunminkang commented 6 years ago

You don't need all CRAMs in the same location for this version of pipeline. Examples will be posted.

On Wed, Oct 3, 2018 at 5:42 PM Walter Shands notifications@github.com wrote:

In freeze3 of the variant caller the joint genotyping step required all CRAMs to be available on disk (or available via gcsfuse or s3fuse). This requirement is not compatible with some cloud infrastructure providers such as Broad's FireCloud and the workflow definition language (WDL) which do not support gcsfuse or s3fuse and which use VMs for tasks.

In particular when the joint genotyping step is implemented in WDL as a task a VM is started on a GCP instance and all inputs, which includes all CRAM files, are localized to the VM disk. In our case this required over 600 CRAM files to be localized to one VM resulting in very large disk space requirements and time to localize. This is not scalable for large number of CRAMs.

Instead what would be helpful is some sort of mode where joint variant calling can be run on one CRAM at a time to produce some sort of gVCF https://software.broadinstitute.org/gatk/documentation/article.php?id=3893 The joint variant calling would be scattered across multiple VMs that only had to localize one CRAM and produce a single gVCF.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/topmed_variant_calling/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AF-OuRXvEqCjcwLmWwj2E7F83kwqfFEFks5uhS89gaJpZM4XG_g3 .