googlegenomics / pipelines-api-examples

Examples for the Google Genomics Pipelines API.
BSD 3-Clause "New" or "Revised" License
50 stars 27 forks source link

Increasing java heap space in cromwell driver #54

Closed vdauwera closed 7 years ago

vdauwera commented 7 years ago

I'm running into some memory issues where Cromwell itself is running out of heap space. One of our engineers tells me they sometimes ran out of memory when running large workflows (lots of inputs/outputs - large scatters) because the default is quite low.

I'm trying to get past this by tweaking the cromwell_driver.py, adding a heap space parameter in

  def start(self):
    """Start the Cromwell service."""
    if self.cromwell_proc:
      logging.info("Request to start Cromwell: already running")
      return

    self.cromwell_proc = subprocess.Popen([
        'java',
        '-Dconfig.file=' + self.cromwell_conf,
        '-Xmx4g',                                                    # <- line I added
        '-jar', self.cromwell_jar,
        'server'])

At the moment it's running so at least I know I didn't break the code... will confirm whether this gets past my memory issue.

In any case I thought it would be useful to document how one might be able to tweak Cromwell's memory settings. It might also be useful to expose this as a parameter of the wdl_runner in some form.

vdauwera commented 7 years ago

Job still fails with the same error. I'm not sure how to check whether my modification actually took effect; would be nice to document that.

mbookman commented 7 years ago

Hi Geraldine,

The command that we provide to launch the wdl runner:

https://github.com/googlegenomics/pipelines-api-examples/tree/master/wdl_runner#run-the-following-command

points to the wdl_pipeline.yaml file:

https://github.com/googlegenomics/pipelines-api-examples/blob/master/wdl_runner/workflows/wdl_pipeline.yaml

which only includes the following requirement for the Cromwell node:

resources:
  minimumRamGb: 1

I believe that means that the machine type that will get picked up is a g1-small (1.70 GB), which - at least when testing Cromwell 0.19 seemed fine for our WDL-base workflows. Perhaps memory requirements are higher with Cromwell 24.

For workflows that require more of Cromwell, user can either update the YAML file or use the --memory flag to gcloud alpha genomics pipelines run.

I don't know if increasing the VM memory is sufficient or if it will need to be combined with updates to the amount of memory given to the JVM.

vdauwera commented 7 years ago

Thanks Matt, this is very helpful. I'll test requesting a beefier machine with the memory flag and see how far that gets me.

Is there anything in the logs that records the specs of the VM that was provisioned?

mbookman commented 7 years ago

You can get information about the VM from the operation record:

gcloud alpha genomics operations describe <op-id> \
   --format='value(metadata.runtimeMetadata.computeEngine)'

To specifically get just the machine type:

gcloud alpha genomics operations describe <op-id> \
   --format='value(metadata.runtimeMetadata.computeEngine.machineType)'
vdauwera commented 7 years ago

Ah sweet, that works. And indeed, specifying --memory 4 made the machine type switch from g1-small to n1-standard-2.

mbookman commented 7 years ago

This should be fixed by https://github.com/googlegenomics/pipelines-api-examples/commit/64f375595347d88493d4557ee28906d305775b87 and https://github.com/googlegenomics/pipelines-api-examples/commit/f7aae674397e6d2bfa536330d93093c96ba319ab.

Does the published example at https://cloud.google.com/genomics/v1alpha2/gatk need to be updated to include the --memory argument or is this only needed when you use large inputs or run different worklows?

vdauwera commented 7 years ago

Yes the example should be updated -- it seems the memory consumption increases is related to the scatter size, independently of dataset size. On Mon, May 1, 2017 at 11:52 AM mbookman notifications@github.com wrote:

This should be fixed by 64f3755 https://github.com/googlegenomics/pipelines-api-examples/commit/64f375595347d88493d4557ee28906d305775b87 and f7aae67 https://github.com/googlegenomics/pipelines-api-examples/commit/f7aae674397e6d2bfa536330d93093c96ba319ab .

Does the published example at https://cloud.google.com/genomics/v1alpha2/gatk need to be updated to include the --memory argument or is this only needed when you use large inputs or run different worklows?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googlegenomics/pipelines-api-examples/issues/54#issuecomment-298372432, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnwE4B6PmseAe3Aq9ZD1p6Evex7BBMfks5r1g2ygaJpZM4Mo08o .

-- Geraldine A. Van der Auwera, PhD Associate Director of Outreach and Communications Data Sciences and Data Engineering Broad Institute

mbookman commented 7 years ago

Thanks Geraldine.

I have updated the GATK on Google Genomics documentation:

I used --memory 5, which still gives you an n1-standard-2. Since Cromwell requests a maximum heap size of 4, allowing 1 GB for other overhead seems right.

mbookman commented 7 years ago

I updated the memory requirement in wdl_pipeline.yaml to 3.75 such that an n1-standard-1 is used instead of a g1-small. I believe that for most pipelines (and certainly for the published vcf_chr_count example) this should be sufficient.