There are several good reasons to reduce our reliance on Dataproc clusters, but some important reasons are that they are less transparent than Hail Batch jobs, they require dedicated "start cluster", "submit to cluster", and "stop cluster" jobs to run their workflows, and are generally more cumbersome and less user friendly.
Dataproc clusters have largely been eliminated from cpg workflows where possible, but one last holdout for them is in the MtToEs stage, as well as the MtToEsSv and MtToEsCnv stages.
In each of these cases, the dataproc_scripts/mt_to_es.py script is invoked in the same way, to transform an existing Hail MatrixTable mt into a seqr ready elasticsearch index.
We should look at improving this process by eliminating the Dataproc cluster. We should use Hail Batch to connect to the Elasticsearch cluster and write to it directly. The use of the plaintext es-password should also be removed.
There are several good reasons to reduce our reliance on Dataproc clusters, but some important reasons are that they are less transparent than Hail Batch jobs, they require dedicated "start cluster", "submit to cluster", and "stop cluster" jobs to run their workflows, and are generally more cumbersome and less user friendly.
Dataproc clusters have largely been eliminated from cpg workflows where possible, but one last holdout for them is in the MtToEs stage, as well as the MtToEsSv and MtToEsCnv stages.
In each of these cases, the
dataproc_scripts/mt_to_es.py
script is invoked in the same way, to transform an existing Hail MatrixTablemt
into a seqr ready elasticsearch index.We should look at improving this process by eliminating the Dataproc cluster. We should use Hail Batch to connect to the Elasticsearch cluster and write to it directly. The use of the plaintext es-password should also be removed.