populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
6 stars 1 forks source link

Mt to es outside dataproc #817

Closed MattWellie closed 4 months ago

MattWellie commented 4 months ago

Closes #800

Tested here with a small Exome MT -> ES (2GB): https://batch.hail.populationgenomics.org.au/batches/461296

~Untested~

Process:

  1. copy the target MT into the VM
  2. start a Hail local Spark instance
  3. generate the ES password from secrets in a config file - this removes the need to pass a secret in plain text
  4. Uses the same MT -> flattened HT method as the existing script
  5. Creates an ElasticSearchClient, modelled on the Hail version. This contains all the method calls we previously executed, instead of importing those methods from seqr-loading-pipelines
  6. Removes the ES Index if it already exists (unexpected)
  7. Pushes a new index by name to the ES instance, cleans up, and writes a 'DONE' file

This is complete theft:

some optimisation params: https://github.com/broadinstitute/seqr-loading-pipelines/blob/c113106204165e22b7a8c629054e94533615e7d2/hail_scripts/elasticsearch/hail_elasticsearch_client.py#L196-L206 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

MattWellie commented 4 months ago

Driver job: https://batch.hail.populationgenomics.org.au/batches/461265/jobs/1 Worker Batch: https://batch.hail.populationgenomics.org.au/batches/461266

This run uses a different approach:

This test run was successful, but it was successful with a teeny weeny baby MT (~2MB total). Just a proof of concept, but a success.

MattWellie commented 4 months ago

Closing as superseded by #829