Closed andreapiso closed 2 years ago
Looks like the JAR was not correctly loaded. Did you install RayDP and pySpark through pip? If not, can you please share how you setup the environment?
Thanks for getting back so quickly!! Yes I install both with PIP.
My environment is a cluster where there is a pre-installed Spark on K8s setup. However Spark is 2.4 so instead I am using pip to install pyspark 3.2 and raydp nightly in my conda environment - and setting spark conf to a different folder so that I can use local spark instead.
These are the variables I am setting:
SPARK_HOME
: pointing to .local/lib/python3.9/site-packages/pyspark/ which i installed with pip
PYSPARK_DRIVER_PYTHON
=python3
PYSPARK_PYTHON
=python3 #to get these to point to the same python 3.9 where i installed pyspark and sparkDP
SPARK_CONF_DIR
=/home/ray_spark/conf/ #my custom spark conf where I have spark-defaults.conf
and spark-env.sh
where i can add all the hadoop environment variables and kerberos authentication settings.
The connection to hive works perfectly, and I can extract the data. Issue happens when i try to convert to Ray dataset.
Do these JARs need to be loaded manually? (if so, which JARs should i load?) I try to print the CLASSPATH in the notebook after seeing the error and i see it's empty (not sure if the spark process sees something different...)
Hi - I tried to load the RayDP JAR manually in spark-defaults.conf:
spark.driver.extraClassPath=/home/.local/lib/python3.9/site-packages/raydp/jars/raydp-0.5.0-SNAPSHOT.jar
spark.executor.extraClassPath=/home/.local/lib/python3.9/site-packages/raydp/jars/raydp-0.5.0-SNAPSHOT.jar
Now the error is different:
Py4JError: org.apache.spark.sql.raydp.ObjectStoreWriter does not exist in the JVM
Is there a list of JARs that need to be included?
Looks like we solved the problem! Apparently the issue was that i was calling once SparkContext()
before setting up the SparkDP - apparently that messes things up.
Glad you solved it! Do you mean you create a SparkContext before using raydp to create it?
yes - if you somehow write:
from pyspark import SparkContext
SparkContext()
import ray
import raydp
ray.init()
spark = raydp.init_spark(...)
things work when you stay in the spark world but will break when you try to interact with Ray.
PS: I had no reason to use spark context - i was just using it to make sure spark was running properly (e.g. my local spark instead of the kubernetes one in the cluster).
Hi, I am on SparkDP nightly (as i wanted to query hive).
I am not able to convert sparkdp dataframes to ray datasets. Have this error even for simple ones.
for example:
Produces:
Do I have something wrong in my Spark configuration? Other conversions, like pandas on spark, work fine.