projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
273 stars 111 forks source link

Databricks TypeError: 'JavaPackage' object is not callable #456

Closed helenxl closed 8 months ago

helenxl commented 2 years ago

I am running an example notebook from Databricks. I have installed glow version 1.1.1 for this cluster. I am encountering an error with glow.register(spark).

What am I missing?

import glow

import json
import numpy as np
import pandas as pd
import pyspark.sql.functions as fx

spark = glow.register(spark)
TypeError: 'JavaPackage' object is not callable
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-3963649488507776> in <module>
      6 import pyspark.sql.functions as fx
      7 
----> 8 spark = glow.register(spark)

/databricks/python/lib/python3.8/site-packages/glow/glow.py in register(session, new_session)
     78     sc = session._sc
     79     return SparkSession(
---> 80         sc, session._jvm.io.projectglow.Glow.register(session._jsparkSession, new_session))
     81 
     82 

TypeError: 'JavaPackage' object is not callable
williambrandler commented 2 years ago

hey @helenxl this error means python cannot find the glow jars

Did you install just the pypi package? Glow also requires the jars that come from Maven coordinates. In this case, io.projectglow:glow-spark3_2.12:1.1.1

What environment are you doing this in? Is it in Databricks or another Spark service or rolling your own Spark?

helenxl commented 2 years ago

Thanks! I missed that requirement.

williambrandler commented 2 years ago

no sweat, I forgot too first time I installed glow via pypi and maven

williambrandler commented 2 years ago

we also have docker containers that contain all the jars and the pypi package.

https://hub.docker.com/u/projectglow

On Databricks you can install via Databricks container services, for Glow v1.1.1 you would point to this Docker Image URL

projectglow/databricks-glow:1.1.1

wjiangal commented 2 years ago

hey @helenxl this error means python cannot find the glow jars

Did you install just the pypi package? Glow also requires the jars that come from Maven coordinates. In this case, io.projectglow:glow-spark3_2.12:1.1.1

What environment are you doing this in? Is it in Databricks or another Spark service or rolling your own Spark?

Hi! May I ask what should I do in the jupyter notebook? I come across the similar problem...

import findspark
import pyspark
import glow
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
spark = glow.register(spark)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-079acb31ab5e> in <module>
      1 import glow
----> 2 spark = glow.register(spark)

C:\anaconda\lib\site-packages\glow\glow.py in register(session, new_session)
     78     sc = session._sc
     79     return SparkSession(
---> 80         sc, session._jvm.io.projectglow.Glow.register(session._jsparkSession, new_session))
     81 
     82 

TypeError: 'JavaPackage' object is not callable
helenxl commented 2 years ago

@williambrandler When specifying projectglow/databricks-glow:1.1.1, the databricks cluster encountered an error pulling the image. I can pull the image using docker cli fine. Do you know what may be missing? Thank you.

Cluster terminated.Reason:Docker image pull failure

Cannot launch the cluster because pulling the docker image failed. Please double check connectivity from workers to the container registry, as well as the credentials used to pull the image.

Internal error message: Container setup failed due to a docker image pull failure: Image doesn't exist or invalid credential to pull image from projectglow/databricks-glow:1.1.1  .
Stdout: 
Stderr: time="2021-12-02T16:43:57Z" level=fatal msg="Error parsing image name \"docker://projectglow/databricks-glow:1.1.1  \": invalid reference format"
Tabinda788 commented 2 years ago

@helenxl Did you get the issue resolved?

williambrandler commented 2 years ago

missed this, please share more information (such as a screenshot of cluster setup) @helenxl @Tabinda788

On Tue, Mar 22, 2022 at 5:08 AM Tabinda @.***> wrote:

@helenxl https://github.com/helenxl Did you get the issue resolved?

— Reply to this email directly, view it on GitHub https://github.com/projectglow/glow/issues/456#issuecomment-1075095832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGPEIZNYFUDI4EBZJKZVTLVBGZ4ZANCNFSM5JFDJLLQ . You are receiving this because you were mentioned.Message ID: @.***>

helenxl commented 2 years ago

Yes, please go ahead to close this issue. I was able to use projectglow in Databricks.

Tabinda788 commented 2 years ago

@helenxl Can we make it work on local?

williambrandler commented 2 years ago

@Tabinda788 would docker work for you, @edg1983 contributed a Dockerfile for running glow outside of databricks, which we have put on the projectglow dockerhub and could be run via docker on local?

https://github.com/projectglow/glow/issues/494 https://github.com/projectglow/glow/pull/503 https://hub.docker.com/r/projectglow/open-source-glow

henrydavidge commented 8 months ago

@Tabinda788 The fix is the same locally -- you need to install the maven library.