h2oai / sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster
https://docs.h2o.ai/sparkling-water/3.3/latest-stable/doc/index.html
Apache License 2.0
954 stars 363 forks source link

AWS Glue Jobs with: {"error":"TypeError: 'JavaPackage' object is not callable","errorType":"EXECUTION_FAILURE"} #2766

Closed maxreis86 closed 2 years ago

maxreis86 commented 2 years ago

Hello everybody,

I am try to using pysparkling.ml.H2OMOJOModel (h2o-pysparkling-3.1==3.36.0.4.post1) for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the error: TypeError: 'JavaPackage' object is not callable.

I opened a ticket in AWS support and they confirmed that Glue environment is ok and the problem is probably with sparkling-water (pysparkling). It seems that some dependency library is missing, but I have no idea which one. The simple code bellow works perfectly if I run in my local computer (I only need to change the mojo path for GBM_grid__1_AutoML_20220323_233606_model_53.zip). The error occurred with any MOJO.zip file.

Could anyone ever run sparkling-water in Glue jobs successfully?

Job Details: -Glue version 3.0 --additional-python-modules, h2o-pysparkling-3.1==3.36.0.4.post1 -Worker type: G1.X -Number of workers: 2 -Using script "createFromMojo.py"

createFromMojo.py:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
from pysparkling.ml import H2OMOJOSettings
from pysparkling.ml import H2OMOJOModel

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ["JOB_NAME"])

#Job setup
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args["JOB_NAME"], args)

caminho_modelo_mojo='s3://prod-lakehouse-stream/modeling/approaches/GBM_grid__1_AutoML_20220323_233606_model_53.zip'
print(caminho_modelo_mojo)
print(dir())

settings = H2OMOJOSettings(convertUnknownCategoricalLevelsToNa = True, convertInvalidNumbersToNa = True)
model = H2OMOJOModel.createFromMojo(caminho_modelo_mojo, settings)

data = {'days_since_last_application': [3, 2, 1, 0], 'job_area': ['a', 'b', 'c', 'd']}

base_escorada = model.transform(spark.createDataFrame(pd.DataFrame.from_dict(data)))

print(base_escorada.printSchema())

print(base_escorada.show())

job.commit()
krasinski commented 2 years ago

Hello @maxreis86,

Thank you for reporting that. I was able to reproduce and analyse your issue. The reason behind it is a bit complicated classpath problem which will need more time to resolve.

In the meantime I would suggest a workaround:

  1. Download the sparkling water distribution zip, exactly the same version you're installing using pip.
  2. In the jars folder you will find a jar named (depending on the version) like this: sparkling-water-assembly-scoring_2.12-3.36.0.4-1-3.1-all.jar
  3. Upload it to your S3 bucket
  4. Add S3 URL to that jar to your Glue job - in the UI it will be Dependent JARs path option
  5. Leave the additional-python-modules parameter as is
maxreis86 commented 2 years ago

Hi dear @krasinski,

I could run successfully following your step!

  1. Downloaded sparkling water distribution zip: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.36.1.1-1-3.1/index.html
  2. Dependent JARs path: s3://bucket_name/sparkling-water-assembly-scoring_2.12-3.36.1.1-1-3.1-all.jar
  3. --additional-python-modules, h2o-pysparkling-3.1==3.36.1.1-1-3.1

Thank you so much for your help!!