Hail configuration is incompatible with latest Dataproc version

ireneisdoomed commented 1 year ago

Describe the bug Running steps from the Genetics ETL fail after the latest Dataproc release. Specifically, the simple step of reading a Parquet directory results in:

py4j.protocol.Py4JJavaError: An error occurred while calling o159.parquet.
: java.lang.NoClassDefFoundError: org/codehaus/janino/InternalCompilerException

Full traceback here.

Observed behaviour

Dataproc has updated their images this week. Spark now complains saying that it cannot find "janino", a Java compiler. I've tried providing the Janino JAR when we send the job in several ways, but I keep seeing the same error.
Interestingly, when I SSH into the cluster and start an interactive session Spark is properly configured and I don't see the error.

I think this has to do with the Hail configuration we provide when we send any job. We define specific parameters to configure Hail here. Something is overwriting or misconfiguring Spark when we provide them. This used to work before.

Expected behaviour I am testing with the GWAS Catalog step, which was working up until the image was updated.

To Reproduce Steps to reproduce the behaviour:

Create dev cluster with make create-dev-cluster (after changing the code version in the TOML)

Send a Pyspark job through the command line (this emulates the instructions we provide with the Python API in workflow_template.py:

gcloud dataproc jobs submit pyspark \
gs://genetics_etl_python_playground/initialisation/cli.py \
--cluster=${YOUR_CLUSTER} \
--project=open-targets-genetics-dev \
--region=europe-west1 \
--properties='spark.jars=/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar,spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.10/site-packages/hail/backend/hail-all-spark.jar,spark.executor.extraClassPath=./hail-all-spark.jar,spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator' \
--py-files='gs://genetics_etl_python_playground/initialisation/1.0.0+il.dataproc/otgenetics-1.0.0+il.dataproc-py3-none-any.whl' \
-- 'step=my_gwas_catalog' '--config-dir=/config' '--config-name=my_config'

See the error saying that janino is missing.

Send a Pyspark job without overwriting Spark properties with hail:

gcloud dataproc jobs submit pyspark \
gs://genetics_etl_python_playground/initialisation/cli.py \
--cluster=${YOUR_CLUSTER} \
--project=open-targets-genetics-dev \
--region=europe-west1 \
--py-files='gs://genetics_etl_python_playground/initialisation/1.0.0+il.dataproc/otgenetics-1.0.0+il.dataproc-py3-none-any.whl' \
-- 'step=my_gwas_catalog' '--config-dir=/config' '--config-name=my_config'

See that the script no longer gets stuck at reading data. Now we actually run into another problem related to Hail when reading a CSV, but I still don't know how to fix it. Full trace here.

Additional context WIP

DSuveges commented 1 year ago

Is this incompatibility inheret to dataproc vs hail? In such case we should report to the hail team. They are quite fast to address issues. At this point, I don't see any reported bugs on this space.

ireneisdoomed commented 1 year ago

Partially, I think. I tested doing hail.init() on an interactive session and it worked. The Java error appeared when initialising hail on an active session and providing the context. So I am assuming the incompatibility comes from our Spark configuration + Hail.

I was testing mainly on the GWASCat step, which unnecessarily initialised Hail. If I removed this line, we could read parquets but it got stuck at reading CSVs due to something about the KryoSerializer, which i believe it is a Hail dependency too. This is what I report in the step 6 above, and where i got stuck.

DSuveges commented 1 year ago

So, we were having issues with hail from the beginning, we got it work with our session was to implement some of hail's initialization action. (This is probably where issue with KryoSerializer comes from). I think there might have been some changes in this file, that we can take a look.

As far as I remember, when using hailctl: --initialization-actions="gs://hail-common/hailctl/dataproc/0.2.95/init_notebook.py"

There might be some important updates on this files for more recent versions.

ireneisdoomed commented 11 months ago

I've come across this issue again, this time when trying to run the variant annotation step in a Jupyter notebook. This is a different scenario than the one reported originally, because I haven't initialised a Spark context before we call Hail.

A very very simple case to reproduce the Janino error: the code crashes when moving from Hail to Spark.

hl.init()
ht = hl.read_table(
            "gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht",
            _load_refs=False,
        )
ht.select_globals().head(2).to_spark(flatten=False)
>>> ...
>>> Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException

I've opened a ticket in their forum hoping they can provide support, you can see the full traceback there https://discuss.hail.is/t/incompatibility-between-hail-and-spark-3-3-2/3616

ireneisdoomed commented 11 months ago

The Hail team made me aware of an issue that was introduced with the version 0.2.123 and suggested me to downgrade to 0.2.122. Now I don't see any Janino errors and I have been able to run the Variant Annotation step 🥳 Hail 0.2.125 is expected to be out this week, and will solve the issue. I'll open a PR with the changes.

opentargets / issues

Hail configuration is incompatible with latest Dataproc version #3088