hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
977 stars 244 forks source link

hl.init should accept a conf parameter #7080

Closed danking closed 4 years ago

danking commented 5 years ago

When Hail is pip-installed, you cannot use it with an already constructed SparkContext, unless said SparkContext's class path includes the hail JAR file. It is somewhat annoying to find the hail JAR location and add that to the class path.

The primary reason users want to pass an already constructed SparkContext is to specify some configuration parameters. hl.init should take a conf as either a SparkConf or a dict.

tennex-adam commented 5 years ago

Is there a possibility of adding an exception for users who need to pass a SparkContext to hl.init?

In our use case, our notebooks use the sparkmagic kernel to communicate with livy running on multiple clusters. Sparkmagic automatically creates a SparkContext when connecting to livy on the cluster master node (AWS EMR). And in releases prior to 0.2.20 we were passing that SparkContext to hl.init.

In our case, the JAR is in our class path.

tpoterba commented 5 years ago

it's already there!

hl.init(sc=sc)

tpoterba commented 5 years ago

ah, I understand

tpoterba commented 5 years ago

essentially you need to add the following as configuration in Spark:

HAIL_JAR_LOCATION=/path/to/python/site-packages/hail/hail-all-spark.jar
spark.jars=${HAIL_JAR_LOCATION}
spark.driver.extraClassPath=${HAIL_JAR_LOCATION}
spark.executor.extraClassPath=./hail-all-spark.jar
tennex-adam commented 5 years ago

Thanks for the quick help! I've just build a new image from master. I'll test and report back.

tennex-adam commented 5 years ago

I've validated our setup has those requirements and we're just hitting a FatalError from a commit a few weeks ago.

https://github.com/hail-is/hail/blob/a0e8eb81e0f4d7ad446723e7cc04d4c6ac4ad066/hail/python/hail/context.py#L59-L67

If I revert this file we're able to pass in the existing SparkContext with the expected hl.init(sc=sc).

As general feedback it may be better to warn here than force exit. I may be wrong but I don't see a way around this for people using remote notebooks to talk to Spark via Livy.

tpoterba commented 5 years ago

oh, whoops, we can definitely make that a warning.

tpoterba commented 5 years ago

7172

tennex-adam commented 5 years ago

Tested locally via Livy and my notebook is happy. I really appreciate your help here!

tpoterba commented 4 years ago

done