Create sample command fails in Google Dataproc Spark 2.11.8

I am getting the error java.io.IOException: Mkdirs failed to create file:/home/sanjay/spark-warehouse/default_verdict.db/vt23_1/.hive-staging_hive_2018-07-17_03-03-28_842_6156432897141230125-1/-ext-10000/_temporary/0/_temporary/attempt_20180717030333_0002_m_000016_3 when I run the command vc.sql("create sample of default.advertiser_apr_orc").show(false). I am running on Dataproc image 1.2, with spark 2.11.8 and verdict-spark-lib-0.4.8.jar. I am running this command as the root user and have done chmod 755 to the dir /home/sanjay/

I even tried with the same configuration given in the documentation with Dataproc 1.0 image and verdict-core-0.3.0-jar-with-dependencies.jar. When I do the create sample command I am getting this error org.apache.hadoop.hive.common.FileUtils: Creating directory if it doesn't exist: hdfs://cluster-16-m/user/hive/warehouse/null_verdict.db/vt66_4/.hive-staging_hive_2018-07-17_07-25-01_583_8460634515176170110-1 java.lang.NullPointerException at edu.umich.verdict.util.StringManipulations.quoteString(StringManipulations.java:132) at edu.umich.verdict.dbms.DbmsSpark.insertEntry(DbmsSpark.java:138) at edu.umich.verdict.dbms.Dbms.insertSampleNameEntryIntoDBMS(Dbms.java:480) at edu.umich.verdict.dbms.DbmsSpark.updateSampleNameEntryIntoDBMS(DbmsSpark.java:146) at edu.umich.verdict.VerdictMeta.insertSampleInfo(VerdictMeta.java:200) at edu.umich.verdict.query.CreateSampleQuery.createUniformRandomSample(CreateSampleQuery.java:120) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:57) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:81) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:39) at edu.umich.verdict.query.Query.computeDataFrame(Query.java:107) at edu.umich.verdict.VerdictSparkHiveContext.execute(VerdictSparkHiveContext.java:40) at edu.umich.verdict.VerdictContext.executeSparkQuery(VerdictContext.java:125) at edu.umich.verdict.VerdictContext.sql(VerdictContext.java:131) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:42) at $iwC$$iwC$$iwC.<init>(<console>:44) at $iwC$$iwC.<init>(<console>:46) at $iwC.<init>(<console>:48) at <init>(<console>:50) at .<init>(<console>:54) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I tried building the master branch ( verdict-spark-lib-0.4.11.jar ) and ran it on a fresh instance of google dataproc 1.2 version. Even in that instance when I run

import edu.umich.verdict.VerdictSpark2Context scala> val vc = new VerdictSpark2Context(sc)
scala> vc.sql("show databases").show(false)
scala> vc.sql("create sample of default.advertiser_06_01_orc").show(false)

I am getting the following error org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/sanjay/spark-warehouse/default_verdict.db, failed to create database default_verdict); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateDatabase(HiveExternalCatalog.scala:163) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createDatabase(ExternalCatalog.scala:69) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:219) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:66) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at edu.umich.verdict.dbms.DbmsSpark2.execute(DbmsSpark2.java:84) at edu.umich.verdict.dbms.DbmsSpark2.executeUpdate(DbmsSpark2.java:91) at edu.umich.verdict.dbms.Dbms.createCatalog(Dbms.java:192) at edu.umich.verdict.dbms.Dbms.createDatabase(Dbms.java:183) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:93) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:64) at edu.umich.verdict.query.Query.computeDataset(Query.java:192) at edu.umich.verdict.VerdictSpark2Context.execute(VerdictSpark2Context.java:61) at edu.umich.verdict.VerdictContext.executeSpark2Query(VerdictContext.java:160) at edu.umich.verdict.VerdictSpark2Context.sql(VerdictSpark2Context.java:81)

What does this error mean?

This seems to be the HDFS (or Hive) permission issue. When I observed similar errors, it was due to the lack of write permission in the spark-warehouse directory.

Can you see if the regular SparkSession.sql("create schema myschema") works? If you are using the Spark interactive shell, the command should be spark.sql("create schema myschema"). Otherwise, the variable "spark" should be replaced with the instance of SparkSession.

Depending on the result of the above command, our investigation will take a different direction.

Thanks, Yongjoo

On Tue, Jul 17, 2018 at 3:55 AM Sanjay Kumar notifications@github.com wrote:

I tried building the master branch and ran it on a fresh instance of google dataproc 1.2 version. Even in that instance when I run

import edu.umich.verdict.VerdictSpark2Context scala> val vc = new VerdictSpark2Context(sc) scala> vc.sql("show databases").show(false) scala> vc.sql("create sample of database_name.table_name").show(false)

I am getting the following error org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/home/sanjay/spark-warehouse/default_verdict.db, failed to create database default_verdict); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateDatabase(HiveExternalCatalog.scala:163) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createDatabase(ExternalCatalog.scala:69) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:219) at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:66) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) at edu.umich.verdict.dbms.DbmsSpark2.execute(DbmsSpark2.java:84) at edu.umich.verdict.dbms.DbmsSpark2.executeUpdate(DbmsSpark2.java:91) at edu.umich.verdict.dbms.Dbms.createCatalog(Dbms.java:192) at edu.umich.verdict.dbms.Dbms.createDatabase(Dbms.java:183) at edu.umich.verdict.query.CreateSampleQuery.buildSamples(CreateSampleQuery.java:93) at edu.umich.verdict.query.CreateSampleQuery.compute(CreateSampleQuery.java:64) at edu.umich.verdict.query.Query.computeDataset(Query.java:192) at edu.umich.verdict.VerdictSpark2Context.execute(VerdictSpark2Context.java:61) at edu.umich.verdict.VerdictContext.executeSpark2Query(VerdictContext.java:160) at edu.umich.verdict.VerdictSpark2Context.sql(VerdictSpark2Context.java:81)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/163#issuecomment-405493729, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQYCvrlAesx3-l_DAhu7C6-gwSsHLPUks5uHZhfgaJpZM4VSHLP .

-- Yongjoo Park, Ph.D. Research Fellow Computer Science and Engineering University of Michigan 2260 Hayward St. Ann Arbor, MI 48109-2121 Office: 4957 Beyster Phone: (734) 707-9206 Website: yongjoopark.com

@pyongjoo I think you are right. I am not able to create the schema as well. I am getting the same error when I try that. How can I resolve this issue?

In my case, I used the regular hdfs command. An example is "hdfs dfs chmod 777 /.../spark-warehouse". This command is usually possible when you have a separate installation of HDFS and Spark is using the HDFS installation. I attach a link for more hdfs commands: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html

I am pretty sure there are lots of other documentation for Google dataproc, but I cannot test it right now.

FYI, we plan to update VerdictDB soon. In that version, you will be able to configure the schema used by Verdict directly.

On Tue, Jul 17, 2018 at 8:22 PM Sanjay Kumar notifications@github.com wrote:

@pyongjoo https://github.com/pyongjoo I think you are right. I am not able to create the schema as well. I am getting the same error when I try that. How can I resolve this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/163#issuecomment-405769081, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQYCvBsmvyc5zL3A1eNs7gW24ktU6r5ks5uHn-ogaJpZM4VSHLP .

verdict-project / verdict

Create sample command fails in Google Dataproc Spark 2.11.8 #163