oap-project / raydp

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Apache License 2.0
308 stars 68 forks source link

Wanted to configure spark to use mysql as metastore instead of derby hive metastore which is default #346

Closed raviranak closed 1 year ago

raviranak commented 1 year ago

Wanted to configure spark to use mysql as metastore instead of derby hive metastore which is default
Not able to finde the hive-site.xml in spark implementation of raydp

kira-lin commented 1 year ago

Hi @raviranak , What will you do if it's vanilla Spark? You can try the same thing. I think vanilla Spark conf directory does not contain hive-site.xml, either. Are you using pyspark installed via pip or binary spark install? Have you tried putting hive-site.xml into the conf dir?

raviranak commented 1 year ago

I have figured way for configuring the hive metastore with mysql , but getting an error

default_spark_conf = { "spark.jars.packages": "mysql:mysql-connector-java:8.0.32", "spark.jars": "/home/ray/.ivy2/jars/com.mysql_mysql-connector-j-8.0.32.jar", "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver", "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true", "spark.hadoop.javax.jdo.option.ConnectionUserName": "test", "spark.hadoop.javax.jdo.option.ConnectionPassword": "", "spark.sql.catalog.spark_catalog.type":"hive", "spark.sql.catalogImplementation":"hive" } spark = raydp.init_spark( app_name="Darwin_SPARK", num_executors=1, executor_cores=1, executor_memory='4G', enable_hive = True, configs=default_spark_conf)

Getting error when trying to create a table like `df = spark.createDataFrame([ (1, "Smith"), (2, "Rose"), (3, "Williams") ], ("id", "name"))

df.write.mode("overwrite").saveAsTable("employees12")`

Stack Trace

2023-05-22 10:45:09,515 WARN HiveMetaStore [Thread-5]: Retrying creating default database after error: Unexpected exception caught. javax.jdo.JDOFatalInternalException: Unexpected exception caught. at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1203) ~[javax.jdo-3.2.0-m3.jar:?] at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:814) ~[javax.jdo-3.2.0-m3.jar:?] at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:702) ~[javax.jdo-3.2.0-m3.jar:?] at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:521) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:550) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:405) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79) ~[hadoop-client-api-3.3.2.jar:?] at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:139) ~[hadoop-client-api-3.3.2.jar:?] at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStoreForConf(HiveMetaStore.java:628) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:594) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:588) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431) ~[hive-metastore-2.3.9.jar:2.3.9] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_362] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_362] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_362] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6902) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:162) ~[hive-metastore-2.3.9.jar:2.3.9] at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70) ~[hive-exec-2.3.9-core.jar:2.3.9]

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.datanucleus.util.NucleusLogger at org.datanucleus.plugin.PluginRegistryFactory.newPluginRegistry(PluginRegistryFactory.java:58) at org.datanucleus.plugin.PluginManager.<init>(PluginManager.java:60) at org.datanucleus.plugin.PluginManager.createPluginManager(PluginManager.java:430) at org.datanucleus.AbstractNucleusContext.<init>(AbstractNucleusContext.java:85) at org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:167) at org.datanucleus.PersistenceNucleusContextImpl.<init>(PersistenceNucleusContextImpl.java:156) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.<init>(JDOPersistenceManagerFactory.java:415) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:304) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:213) ... 80 more

raviranak commented 1 year ago

Could you please help @kira-lin

raviranak commented 1 year ago

Did same with SparkSession and its working

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark Examples") \ .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true") \ .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "com.mysql.cj.jdbc.Driver") \ .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "test") \ .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "") \ .config("spark.sql.catalogImplementation","hive") \ .config("spark.sql.catalog.spark_catalog.type","hive") \ .config("spark.jars","/home/ray/.ivy2/jars/com.mysql_mysql-connector-j-8.0.32.jar") \ .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.32") \ .enableHiveSupport().getOrCreate()

raviranak commented 1 year ago

This seems to be a bug , could you please look into it