googlegenomics / gcp-deepvariant-runner

This repository contains a docker container that runs DeepVariant on the Google Cloud Platform.
Apache License 2.0
2 stars 6 forks source link

Running Worker Machines Without External IP Address #26

Closed obsh closed 5 years ago

obsh commented 5 years ago

Hi,

I wonder if there is an option to create worker machines without external IP addresses? I'm Trying to run large number of pipelines in GCP and stuck with IP address quota.

Regards.

samanvp commented 5 years ago

Unfortunately each worker needs external IP to communicate back with the main runner. What I suggest is:

Unfortunately there is not a perfect solution; you need to compromise either cost or time.

Please let me know if you need help with setting the input argument to optimize the cost based on the size of the BAM file and the type of analysis.

obsh commented 5 years ago

Thank you for recommendations! I’ll try to run with larger worker machines.

Sure, will appreciate if you could give any suggestions on the run configuration. I’m working on a cannabis variants project with a Googler @allenday and I think the goal is to optimize for smaller overall running time. We have 16,000 BAM files with sizes in the range from 60MB to 17GB and reference fa files from 300MB - 1.2GB. We need to produce vcf files. From experience of running a couple of pipelines we selected make example worker machines with a 60GB RAM and 10 CPU as VMs were failing with "out of memory" error when using with less RAM.

All arguments to the runner:

cmd: |
  ./opt/deepvariant_runner/bin/gcp_deepvariant_runner \
    --project "${PROJECT_ID}" \
    --zones "${ZONES}" \
    --docker_image "${DOCKER_IMAGE}" \
    --docker_image_gpu "${DOCKER_IMAGE_GPU}" \
    --gpu \
    --outfile "${OUTPUT_BUCKET}"/"${OUTPUT_FILE_NAME}" \
    --staging "${OUTPUT_BUCKET}"/"${STAGING_FOLDER_NAME}" \
    --model "${MODEL}" \
    --ref "${INPUT_REF}" \
    --bam "${INPUT_BAM}" \
    --shards 512 \
    --make_examples_workers 16 \
    --make_examples_cores_per_worker 10 \
    --make_examples_ram_per_worker_gb 60 \
    --make_examples_disk_per_worker_gb 200 \
    --call_variants_workers 16 \
    --call_variants_cores_per_worker 8 \
    --call_variants_ram_per_worker_gb 30 \
    --call_variants_disk_per_worker_gb 50
obsh commented 5 years ago

With following model and images:

MODEL=gs://deepvariant/models/DeepVariant/0.6.0/DeepVariant-inception_v3-0.6.0+cl-191676894.data-wgs_standard
IMAGE_VERSION=0.6.1
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
DOCKER_IMAGE_GPU=gcr.io/deepvariant-docker/deepvariant_gpu:"${IMAGE_VERSION}"
samanvp commented 5 years ago

Here are a couple of small changes that will definitely makes your run more efficient:

I just want to mention that all my experience of optimizing these flags is for human sample BAM files. I am not really sure what is the density of variants in cannabis. So you might want to apply some fine tuning on top of what I suggested.

Please let me know if there is anything else I can help with.

obsh commented 5 years ago

Thank you very much for the recommendations and explanation of logic behind it! I'll try to run a new setup this week.