jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

py4j.protocol.Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM #38

Closed CarlaFernandez closed 2 years ago

CarlaFernandez commented 2 years ago

Hello @vruusmann , First of all I'd like to say that I've checked the issue #13 but I don't think it's the same problem.

I've created a virtual environment and installed pyspark and pyspark2pmml using pip. In this virtual environment, inside Lib/site-packages/pyspark/jars I've pasted the jar for JPMML-SparkML (org.jpmml:pmml-sparkml:2.2.0 for spark version 3.2.2).

When I instantiate a PMMLBuilder object I get the error in the title. This is a MWE that throws the error:

spark = (
    SparkSession.builder.appName("spark_test")
    .master("local[*]")
    .config("spark.jars.packages", "org.jpmml:pmml-sparkml:2.2.0") # it doesn't matter if I add this configuration or not, I still get the error
    .getOrCreate()
)
javaPmmlBuilderClass = builder.sparkContext._jvm.org.jpmml.sparkml.PMMLBuilder

Any idea what might I be missing from my environment to make it work? Thank you

vruusmann commented 2 years ago

Any idea what might I be missing from my environment to make it work?

Does it work when you launch PySpark from command-line, and specify the --packages command-line option?

I have zero working experience with virtual environments. And I've never installed any JAR files manually to site-packages/pyspark/jars/ directory.

If I was facing a similar problem, then I'd start by checking the PySpark/Apache Spark log file. There must be some information about which packages are detected, and which of them are successfully "initialized" and which are not (possibly with an error reason).

CarlaFernandez commented 2 years ago

Thanks for the quick response. Indeed, looking at the detected packages in the log is what helped me.

I started the environment from scratch, removed the jar I had manually installed, and started the session in the MWE without the spark.jars.packages config. It threw a RuntimeError: JPMML-SparkML not found on classpath.

Then, I added the spark.jars.packages line and it worked! So it seems like the problem was caused by adding the jar manually.

I hadn't detected this before because my real configuration was more complex and I was using delta-spark. Apparently, when using delta-spark the packages were not being downloaded from Maven and that's what caused the original error.

Thanks!

Tangjiandd commented 2 years ago

Hello @vruusmann , First of all I'd like to say that I've checked the issue #13 but I don't think it's the same problem.

I've created a virtual environment and installed pyspark and pyspark2pmml using pip. In this virtual environment, inside Lib/site-packages/pyspark/jars I've pasted the jar for JPMML-SparkML (org.jpmml:pmml-sparkml:2.2.0 for spark version 3.2.2).

When I instantiate a PMMLBuilder object I get the error in the title. This is a MWE that throws the error:

spark = (
    SparkSession.builder.appName("spark_test")
    .master("local[*]")
    .config("spark.jars.packages", "org.jpmml:pmml-sparkml:2.2.0") # it doesn't matter if I add this configuration or not, I still get the error
    .getOrCreate()
)
javaPmmlBuilderClass = builder.sparkContext._jvm.org.jpmml.sparkml.PMMLBuilder

Any idea what might I be missing from my environment to make it work? Thank you

hello, this problem has been solved?