projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

configure sparkSession with glow #478

Closed project-defiant closed 2 years ago

project-defiant commented 2 years ago

Hey, I am trying to run tests with pytest for functions that utilize io.projectglow and deltalake from databricks.

I have created spark session object with the following code

import pyspark, delta, glow, os

builder = (
    pyspark.sql.SparkSession.builder.appName("spark-test-session")
    .master("local[*]")
    .config("spark.jars.packages", "io.projectglow:glow-spark3_1.2:1.1.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)

spark_session_builder = delta.configure_spark_with_delta_pip(builder)
spark_session_with_delta = builder.getOrCreate()
spark_session = glow.register(spark_session_with_delta, False)

when I try to run the python script with former code, I get following issue

Traceback (most recent call last):
  File "/home/admin01/projects/glake-func/genomelake_functions/test.py", line 17, in <module>
    spark_session = glow.register(spark_session_with_delta, False)
  File "/home/admin01/projects/glake-func/genomelake_functions/.venv/lib/python3.10/site-packages/glow/glow.py", line 80, in register
    sc, session._jvm.io.projectglow.Glow.register(session._jsparkSession, new_session))
TypeError: 'JavaPackage' object is not callable`

It turns out that I can not call register method on sparkSession. Any idea what could resolve this issue? Should I provide projectglow package in different way?

My default test environment is venv with following packages:

attrs==21.4.0
black==21.12b0
click==8.0.3
coverage==6.2
delta-spark==1.0.0
distlib==0.3.4
filelock==3.4.2
flake8==3.9.2
glow.py==1.1.1
importlib-metadata==4.10.1
iniconfig==1.1.1
mccabe==0.6.1
mypy-extensions==0.4.3
nptyping==1.3.0
numpy==1.22.1
opt-einsum==3.3.0
packaging==21.3
pandas==1.4.0
pathspec==0.9.0
patsy==0.5.2
platformdirs==2.4.1
pluggy==1.0.0
py==1.11.0
py4j==0.10.9
pycodestyle==2.7.0
pyflakes==2.3.1
pyparsing==3.0.7
pyspark==3.1.2
pytest==6.2.5
pytest-cov==2.12.1
python-dateutil==2.8.2
pytz==2021.3
scipy==1.7.3
six==1.16.0
statsmodels==0.13.1
toml==0.10.2
tomli==1.2.3
tox==3.24.3
typeguard==2.9.1
typing_extensions==4.0.1
typish==1.9.3
virtualenv==20.13.0
zipp==3.7.0
williambrandler commented 2 years ago

hey @PROJECT-DEFIANT this error

TypeError: 'JavaPackage' object is not callable

means that the spark drivers and executors cannot find the Jars on your class path, my understanding is that you have to explicitly add spark.driver.extraClassPath and spark.executor.extraClassPath to the JAVA_OPTS to resolve the issue.

This is what we did for the Glow Docker container in Databricks,

see https://github.com/projectglow/glow/blob/master/docker/databricks/dbr/dbr9.1/genomics-with-glow/Dockerfile#L36

And also you can refer to this thread: https://github.com/JohnSnowLabs/spark-nlp/issues/232#issuecomment-458888900

What environment are you installing Glow in? Please share more details

project-defiant commented 2 years ago

Hey @williambrandler, many thanks for the feedback, as if I am setting up my environment within python module with spark local session created in module I was able to solve the above issue with the JohnSnowLabs/spark-nlp#232 (comment). It turned out that I needed to provide the jar file at the end.

Many thanks yet again for resolving this issue. Keep up the great work