oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
308 stars 68 forks source link

enable_hive not working with raydp #347

Closed raviranak closed 10 months ago

raviranak commented 1 year ago

Configuration for running spark in raydp :

default_spark_conf = { "spark.jars.packages": "mysql:mysql-connector-java:8.0.32", "spark.jars": "/home/ray/.ivy2/jars/com.mysql_mysql-connector-j-8.0.32.jar", "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver", "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true", "spark.hadoop.javax.jdo.option.ConnectionUserName": "test", "spark.hadoop.javax.jdo.option.ConnectionPassword": "", "spark.sql.catalog.spark_catalog.type":"hive", "spark.sql.catalogImplementation":"hive" } spark = raydp.init_spark( app_name="Darwin_SPARK", num_executors=1, executor_cores=1, executor_memory='4G', enable_hive = True, configs=default_spark_conf)

Getting error when trying to create a table like `df = spark.createDataFrame([ (1, "Smith"), (2, "Rose"), (3, "Williams") ], ("id", "name"))

df.write.mode("overwrite").saveAsTable("employees12")`

Stack Trace

2023-05-22 10:45:09,515 WARN HiveMetaStore [Thread-5]: Retrying creating default database after error: Unexpected exception caught. javax.jdo.JDOFatalInternalException: Unexpected exception caught. at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1203) ~[javax.jdo-3.2.0-m3.jar:?] at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:814) ~[javax.jdo-3.2.0-m3.jar:?] at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:702) ~[javax.jdo-3.2.0-m3.jar:?] at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:521) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:550) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:405) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79) ~[hadoop-client-api-3.3.2.jar:?] at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:139) ~[hadoop-client-api-3.3.2.jar:?] at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStoreForConf(HiveMetaStore.java:628) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:594) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:588) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431) ~[hive-metastore-2.3.9.jar:2.3.9] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_362] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_362] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_362] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:79) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6902) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:162) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) ~[hive-exec-2.3.9-core.jar:2.3.9]

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.datanucleus.util.NucleusLogger at org.datanucleus.plugin.PluginRegistryFactory.newPluginRegistry(PluginRegistryFactory.java:58) at org.datanucleus.plugin.PluginManager.(PluginManager.java:60) at org.datanucleus.plugin.PluginManager.createPluginManager(PluginManager.java:430) at org.datanucleus.AbstractNucleusContext.(AbstractNucleusContext.java:85) at org.datanucleus.PersistenceNucleusContextImpl.(PersistenceNucleusContextImpl.java:167) at org.datanucleus.PersistenceNucleusContextImpl.(PersistenceNucleusContextImpl.java:156) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:415) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:304) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:213) ... 80 more

raviranak commented 1 year ago

Did same with pyspark SparkSession and its working

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark Examples") .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true") .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "com.mysql.cj.jdbc.Driver") .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "test") .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "") .config("spark.sql.catalogImplementation","hive") .config("spark.sql.catalog.spark_catalog.type","hive") .config("spark.jars","/home/ray/.ivy2/jars/com.mysql_mysql-connector-j-8.0.32.jar") .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.32") .enableHiveSupport().getOrCreate()

kira-lin commented 1 year ago

Hi @raviranak , I think what you set in spark.jars and spark.jars.packages should be the same package, right? That should be able to work around #247 .

Seems like org.datanucleus.util.NucleusLogger is not presented in executors' classpath. Where is this jar? Is it a dependency of Hive?

kira-lin commented 1 year ago

Can you find where the jar is? Maybe you can try to set it in raydp.executor.extraClassPath

raviranak commented 1 year ago

This is happening only for spark 3.3.1 version of raydp

raviranak commented 1 year ago

Added these to config during raydp.init_spark configs "raydp.executor.extraClassPath":"/home/ray/.ivy2/jars/org.datanucleus_datanucleus-core-5.0.9.jar"

But in log i see Warning: Ignoring non-Spark config property: raydp.executor.extraClassPath

kira-lin commented 1 year ago

This is happening only for spark 3.3.1 version of raydp

Do you mean you only tested this version? Or that it works in other versions of Spark?

kira-lin commented 1 year ago

But in log i see Warning: Ignoring non-Spark config property: raydp.executor.extraClassPath

Yes, you'll see such log, but the config should be set to executors by our code. Did it work?

raviranak commented 1 year ago

This works only on ray runtime 2.0.0 but not on 2.3.0 as in 2.3.0 its using spark 3.3.1 , could you please check

raviranak commented 1 year ago

raydp with spark 3.3.1 this is not working

raviranak commented 1 year ago

Seems Spark initilisation differs in raydb and using native SparkSession . Could you please update with resolution of this, seem like some missing jars using raydp which is not occuring with pyspark

kira-lin commented 1 year ago

When you are testing, are you using the same Spark install for RayDP? We add the Spark's jar dir to RayDP's class path. What jars did you find missing in RayDP?

This works only on ray runtime 2.0.0 but not on 2.3.0 as in 2.3.0 its using spark 3.3.1 , could you please check

You should be able to use other Spark versions as well, no matter Ray 2.3.0

raviranak commented 1 year ago

Yes using the same spark version i.e 3.3.1 , could you test raydp spark 3.3.1 for enable_hive

raviranak commented 1 year ago

Could not initialize class org.datanucleus.util.NucleusLogger , could you check for raydp if there is any jar that could cause the nucleous-core different version to be picked while using raydp , this seems to be only probable cause @kira-lin

kira-lin commented 1 year ago

could you please provide instructions to reproduce this issue? How should I configure Hive and MySQL?

pang-wu commented 1 year ago

@raviranak This could be the issue with using spark.jars.packages and spark.jars on raydp, did you try to place all the jars and their transitive dependencies under the jars dir in your Spark installation? This thread might help: https://github.com/oap-project/raydp/issues/247

@kira-lin @carsonwang If that is the case.. we might want to take a look to fix spark.jars.packages and spark.jars on RayDP

raviranak commented 1 year ago

@kira-lin this can be easily reproduced even without data source for hive like this

import raydp spark = raydp.init_spark( app_name="Darwin_SPARK", num_executors=1, executor_cores=1, executor_memory='4G', enable_hive = True, configs=None)

spark.sql("show tables")

This doesn't work with ray runtime 2.3.0 , but works o ray 2.0.0 and gives same exception trace

@pang-wu I have set the jars appropriately

kira-lin commented 1 year ago

I tried your script with Ray 2.3.0, Spark 3.3.1, and RayDP 1.6.0b20230527.dev0, and get the following

>>> import raydp
>>> spark = raydp.init_spark(
... app_name="Darwin_SPARK",
... num_executors=1,
... executor_cores=1,
... executor_memory='4G',
... enable_hive = True,
... configs=None)
ERROR StatusLogger Reconfiguration failed: No configuration found for '18b4aac2' at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null'
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/06 11:25:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> 
>>> spark.sql("show tables")
2023-06-06 11:25:45,764 Thread-4 ERROR Reconfiguration failed: No configuration found for '1ff4c785' at 'null' in 'null'
DataFrame[namespace: string, tableName: string, isTemporary: boolean]

Is this the same behavior as yours? @raviranak

Deegue commented 1 year ago

@kira-lin this can be easily reproduced even without data source for hive like this

import raydp spark = raydp.init_spark( app_name="Darwin_SPARK", num_executors=1, executor_cores=1, executor_memory='4G', enable_hive = True, configs=None)

spark.sql("show tables")

This doesn't work with ray runtime 2.3.0 , but works o ray 2.0.0 and gives same exception trace

@pang-wu I have set the jars appropriately

I also tried the latest Ray 3.0.0.dev, RayDP 1.6.0.dev and Spark 3.3.2. It worked well.

Can you share your RayDP version?

raviranak commented 1 year ago

ray dp version 1.5.0

kira-lin commented 1 year ago

Can you try RayDP nightly and see if it fix your problem? You can install it like pip install raydp -u --pre

raviranak commented 1 year ago

Working on 1.6.0.dev ,can you please help with release timeline for raydp 1.6.0

carsonwang commented 1 year ago

Working on 1.6.0.dev ,can you please help with release timeline for raydp 1.6.0

@raviranak 1.6.0 will be released by end of this month.

raviranak commented 1 year ago

thanks @carsonwang @kira-lin

raviranak commented 11 months ago

This is still occurring in raydp 1.6.0