projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
266 stars 111 forks source link

Load glow mave package from Python #521

Closed dridk closed 1 year ago

dridk commented 1 year ago

Hi,

I am new with Pyspark and I would like to test glow to load a VCF file. Actually, I can use pyspark without starting spark manually from the command line. It looks like it start automatically from Python. As I know, all spark binaries comes with pyspark pip installation.

So, if I run the following code :

from pyspark.sql import SparkSession
import glow 
spark = SparkSession.builder.appName("test").getOrCreate()
spark = glow.register(spark)

It returns the following error :

In [5]: spark = glow.register(spark)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 spark = glow.register(spark)

File ~/.local/lib/python3.10/site-packages/glow/glow.py:80, in register(session, new_session)
     77 assert check_argument_types()
     78 sc = session._sc
     79 return SparkSession(
---> 80     sc, session._jvm.io.projectglow.Glow.register(session._jsparkSession, new_session))

TypeError: 'JavaPackage' object is not callable

I suppose I have to run spark with the following line . But I have no idea how to add this package dependencies from SparkSession in python code.

--packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

williambrandler commented 1 year ago

are you running locally or in the cloud?

Try starting the spark shell with the glow maven package

https://glow.readthedocs.io/en/latest/getting-started.html#running-locally

dridk commented 1 year ago

locally. I cannot run the following command because I have not access to bash. I only have a jupyter notebook.

./bin/pyspark --packages io.projectglow:glow-spark3_2.12:1.2.1 --conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec

An example of glow running on google collab would be helpful

williambrandler commented 1 year ago

can you execute terminal commands from your jupyter notebook? https://stackoverflow.com/questions/38694081/executing-terminal-commands-in-jupyter-notebook

Hoeze commented 1 year ago

@dridk You can use this snippet:

spark = (
    SparkSession.builder
    .appName('your_app')
    .config("spark.jars.packages", ",".join([
        "io.projectglow:glow-spark3_2.12:1.2.1",
    ]))
    # .config("spark.local.dir", os.environ.get("TEMPDIR"))
    # .config('spark.sql.caseSensitive', "true")
    .config("spark.hadoop.io.compression.codecs", "io.projectglow.sql.util.BGZFCodec")
    .getOrCreate()
)
glow.register(spark)
spark

Make sure your PySpark version is v3.2.x in this case :)

dridk commented 1 year ago

Thank you ! I will try and let you know if it works !