Closed ireneisdoomed closed 11 months ago
Is this incompatibility inheret to dataproc vs hail? In such case we should report to the hail team. They are quite fast to address issues. At this point, I don't see any reported bugs on this space.
Partially, I think. I tested doing hail.init()
on an interactive session and it worked.
The Java error appeared when initialising hail on an active session and providing the context. So I am assuming the incompatibility comes from our Spark configuration + Hail.
I was testing mainly on the GWASCat step, which unnecessarily initialised Hail. If I removed this line, we could read parquets but it got stuck at reading CSVs due to something about the KryoSerializer, which i believe it is a Hail dependency too. This is what I report in the step 6 above, and where i got stuck.
So, we were having issues with hail from the beginning, we got it work with our session was to implement some of hail's initialization action. (This is probably where issue with KryoSerializer comes from). I think there might have been some changes in this file, that we can take a look.
As far as I remember, when using hailctl: --initialization-actions="gs://hail-common/hailctl/dataproc/0.2.95/init_notebook.py"
There might be some important updates on this files for more recent versions.
I've come across this issue again, this time when trying to run the variant annotation step in a Jupyter notebook. This is a different scenario than the one reported originally, because I haven't initialised a Spark context before we call Hail.
A very very simple case to reproduce the Janino error: the code crashes when moving from Hail to Spark.
hl.init()
ht = hl.read_table(
"gs://gcp-public-data--gnomad/release/3.1.2/ht/genomes/gnomad.genomes.v3.1.2.sites.ht",
_load_refs=False,
)
ht.select_globals().head(2).to_spark(flatten=False)
>>> ...
>>> Error summary: ClassNotFoundException: org.codehaus.janino.InternalCompilerException
I've opened a ticket in their forum hoping they can provide support, you can see the full traceback there https://discuss.hail.is/t/incompatibility-between-hail-and-spark-3-3-2/3616
The Hail team made me aware of an issue that was introduced with the version 0.2.123 and suggested me to downgrade to 0.2.122. Now I don't see any Janino errors and I have been able to run the Variant Annotation step š„³ Hail 0.2.125 is expected to be out this week, and will solve the issue. I'll open a PR with the changes.
Describe the bug Running steps from the Genetics ETL fail after the latest Dataproc release. Specifically, the simple step of reading a Parquet directory results in:
Full traceback here.
Observed behaviour
I think this has to do with the Hail configuration we provide when we send any job. We define specific parameters to configure Hail here. Something is overwriting or misconfiguring Spark when we provide them. This used to work before.
Expected behaviour I am testing with the GWAS Catalog step, which was working up until the image was updated.
To Reproduce Steps to reproduce the behaviour:
make create-dev-cluster
(after changing the code version in the TOML)workflow_template.py
:Additional context WIP