populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Replace MtToEs stages Dataproc cluster usage with Hail Batch #800

Open EddieLF opened 1 week ago

EddieLF commented 1 week ago

There are several good reasons to reduce our reliance on Dataproc clusters, but some important reasons are that they are less transparent than Hail Batch jobs, they require dedicated "start cluster", "submit to cluster", and "stop cluster" jobs to run their workflows, and are generally more cumbersome and less user friendly.

Dataproc clusters have largely been eliminated from cpg workflows where possible, but one last holdout for them is in the MtToEs stage, as well as the MtToEsSv and MtToEsCnv stages.

In each of these cases, the dataproc_scripts/mt_to_es.py script is invoked in the same way, to transform an existing Hail MatrixTable mt into a seqr ready elasticsearch index.

script = (
     f'cpg_workflows/dataproc_scripts/mt_to_es.py '
     f'--mt-path {dataset_mt_path} '
     f'--es-index {index_name} '
     f'--done-flag-path {done_flag_path} '
     f'--es-password {es_password_string}'
)
...
j = dataproc.hail_dataproc_job(
      get_batch(),
      script,
      ...,
      depends_on=inputs.get_jobs(dataset)
)

We should look at improving this process by eliminating the Dataproc cluster. We should use Hail Batch to connect to the Elasticsearch cluster and write to it directly. The use of the plaintext es-password should also be removed.