hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
971 stars 243 forks source link

Getting Started Guide for PyHail #1218

Closed jjfarrell closed 7 years ago

jjfarrell commented 7 years ago

I see the docs for the PyHail API but is there a getting started guide available yet? Also are there any plans to make a PyHail package available for installation through PyPI?

tpoterba commented 7 years ago

The getting started, tutorial, and command documentation are all coming to python in the next week or two. We can certainly look into registering Hail on PyPI, but I'm not sure how versioning works there -- with the current pace of development, we may want to hold off on that until a stable release.

jjfarrell commented 7 years ago

Great! Looking forward to testing out PyHail.

cseed commented 7 years ago

We now have a Getting Started the python API:

https://hail.is/pyhail/getting_started.html

Please give it a spin and let us know if you run into any problems. The documentation for the python API is nearly complete, but the Tutorial and General Reference section are still being ported to python and will need another week or so. Thanks for your patience!

jjfarrell commented 7 years ago

Great! I will test it out on our cluster. First, I have question on the Spark version that is recommended. At the very top of the Getting Started with Python API, the document indicates the latest version of Spark 2 should be used. But later on under the Running on a Spark cluster and in the cloud section, it indicates only Spark 1.5 and 1.6 are supported. Which version would be the best to use? Or does it really depend on whether it is run locally or on a cluster?

On Thu, Jan 12, 2017 at 11:21 PM, cseed notifications@github.com wrote:

We now have a Getting Started the python API:

https://hail.is/pyhail/getting_started.html

Please give it a spin and let us know if you run into any problems. The documentation for the python API is nearly complete, but the Tutorial and General Reference section are still being ported to python and will need another week or so. Thanks for your patience!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hail-is/hail/issues/1218#issuecomment-272357689, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3rDZjYATb82CmTNeP61RpKxMCFMhInks5rRvvYgaJpZM4La8Pf .

-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-638-5491

tpoterba commented 7 years ago

That's a docs bug -- it should definitely say Spark 2 there as well!

cseed commented 7 years ago

Just to clarify, Spark 2 is preferred. We'll be dropping Spark 1 support in a few of weeks. Spark 2 has a bunch of performance improvements and features we want to take advantage of in the coming months.

jjfarrell commented 7 years ago

There is one issue for the hail alias. The alias refers to $SPARK_HOME/python/lib/py4j-0.10.3-src.zip However, the py4j zip file varies from Spark version to spark version. For example, these are the different versions for spark on our system.

/share/pkg/spark/1.2.0/install/python/lib/py4j-0.8.2.1-src.zip /share/pkg/spark/1.3.1/install/python/lib/py4j-0.8.2.1-src.zip /share/pkg/spark/1.4.0/install/python/lib/py4j-0.8.2.1-src.zip /share/pkg/spark/1.5.0/install/python/lib/py4j-0.8.2.1-src.zip /share/pkg/spark/1.6.0/install/python/lib/py4j-0.9-src.zip /share/pkg/spark/1.6.1/install/python/lib/py4j-0.9-src.zip /share/pkg/spark/2.0.0/install/python/lib/py4j-0.10.1-src.zip /share/pkg/spark/2.1.0/install/python/lib/py4j-0.10.4-src.zip

So I got the following error since I was using Spark 2.1.0 which has py4j-0.10.4-src.zip instead of py4j-0.10.3-src.zip in the alias.

import pyhail Traceback (most recent call last): File "", line 1, in File "/restricted/projectnb/genpro/github/hail/python/pyhail/init.py", line 1, in from pyhail.context import HailContext File "/restricted/projectnb/genpro/github/hail/python/pyhail/context.py", line 1, in from pyspark.java_gateway import launch_gateway File "/share/pkg/spark/2.1.0/install/python/pyspark/init.py", line 44, in from pyspark.context import SparkContext File "/share/pkg/spark/2.1.0/install/python/pyspark/context.py", line 29, in from py4j.protocol import Py4JError ImportError: No module named py4j.protocol

The following will fix the issue. Essentially it sets PYJ4 to the py4j zip file found in SPARK_HOME. Then uses that to set the PYTHONPATH.

PYJ4=ls $SPARK_HOME/python/lib/py4j*.zip alias hail="PYTHONPATH=$SPARK_HOME/python:$PYJ4:$HAIL_HOME/python SPARK_CLASSPATH=$HAIL_HOME/build/libs/hail-all-spark.jar python"

On Thu, Jan 12, 2017 at 11:21 PM, cseed notifications@github.com wrote:

We now have a Getting Started the python API:

https://hail.is/pyhail/getting_started.html

Please give it a spin and let us know if you run into any problems. The documentation for the python API is nearly complete, but the Tutorial and General Reference section are still being ported to python and will need another week or so. Thanks for your patience!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hail-is/hail/issues/1218#issuecomment-272357689, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3rDZjYATb82CmTNeP61RpKxMCFMhInks5rRvvYgaJpZM4La8Pf .

-- John Farrell, Ph.D. Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

ph: 617-638-5491