verdictdb / verdictdb-tutorial

1 stars 4 forks source link

java.lang.OutOfMemoryError #5

Open yunyadbis opened 5 years ago

yunyadbis commented 5 years ago

@pyongjoo Hello, When I tried to use Verdict on a big dataset(500,000 rows), it throwed an error "java.lang.OutOfMemoryError", it shows the error is due to StringBuilder.append.

Exception in thread "main" java.lang.OutOfMemoryError
    at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
    at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at org.verdictdb.core.querying.ExecutableNodeBase.print(ExecutableNodeBase.java:376)
    at org.verdictdb.core.querying.ExecutableNodeBase.getStructure(ExecutableNodeBase.java:363)
    at org.verdictdb.coordinator.SelectQueryCoordinator.process(SelectQueryCoordinator.java:137)
    at org.verdictdb.coordinator.ExecutionContext.streamsql(ExecutionContext.java:170)
    at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:120)
    at org.verdictdb.VerdictContext.sql(VerdictContext.java:333)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:133)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:129)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at example.GetTrueRawAnswer$.writeTrueRawResultToTxtFile(GetTrueRawAnswer.scala:129)
    at example.GetTrueRawAnswer$.main(GetTrueRawAnswer.scala:37)
    at example.GetTrueRawAnswer.main(GetTrueRawAnswer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

My code is here, inputFile contains 10000 random queries.

def writeTrueRawResultToTxtFile(inputFile:String, outputFile:String, spark:SparkSession, verdict:VerdictContext): Unit={
    val lines = spark.read.textFile(inputFile).collect
    println(s"[System Information] Starting to write to a file: $outputFile")
    val writer = new PrintWriter(new File(outputFile))

    for(line <- lines){
      println(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()))
      print(line)
      val true_answer = spark.sql(line).rdd.collect()(0)
      println(true_answer)
      val raw_answer = verdict.sql(line)
      raw_answer.printCsv()
      writer.write(line+"\t"+true_answer+"\t"+raw_answer.getValue(0))
      //writer.write(line+"\t"+true_answer)
      writer.write("\n")
      writer.flush()
    }
    writer.close()
    println(s"[System Information] Finished writing to a file: $outputFile")
  }

Could you please help me with the problem?

pyongjoo commented 5 years ago

Thank you for the question. I think this issue has been solved already.

Can you try with the latest version? https://mvnrepository.com/artifact/org.verdictdb/verdictdb-core/0.5.7

yunyadbis commented 5 years ago

Thank you for your reply 👍 @pyongjoo I've tried with the lastest version 0.5.7, another error occured:

Exception in thread "main" java.lang.NullPointerException
    at org.verdictdb.coordinator.QueryResultAccuracyEstimatorFromDifference.checkConverge(QueryResultAccuracyEstimatorFromDifference.java:194)
    at org.verdictdb.coordinator.QueryResultAccuracyEstimatorFromDifference.isLastResultAccurate(QueryResultAccuracyEstimatorFromDifference.java:119)
    at org.verdictdb.coordinator.ExecutionContext.sqlSelectQuery(ExecutionContext.java:249)
    at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:160)
    at org.verdictdb.VerdictContext.sql(VerdictContext.java:388)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:128)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:123)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at example.GetTrueRawAnswer$.writeTrueRawResultToTxtFile(GetTrueRawAnswer.scala:123)
    at example.GetTrueRawAnswer$.main(GetTrueRawAnswer.scala:43)
    at example.GetTrueRawAnswer.main(GetTrueRawAnswer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

By the way, every time I tried using verdict.sql(), it first showed

ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7

And after a long time, it shows an error or a result. Verdictdb does not show its advantage in efficiency. Is there anything wrong with my operation? Thank you!!

pyongjoo commented 5 years ago

Thanks for letting me the error. I am deploying a new version right now. I'll leave a comment here when it's done.

Regarding ANTRL message, it is expected behavior if you are using Spark. Both VerdictDB and Spark uses ANTLR for parsing SQL queries, but their versions may not match. To avoid the message, I tried to use the same version as Spark, but unfortunately, different versions of Spark include different versions of ANTLR, which makes me hard to match.

Regarding speedup, Spark is extremely slow for small data. For noticeable speedup, the dataset size should be at least 10 GB. Other databases (Impala, Presto), however, are much faster, so Verdict's performance benefit is much clearer even for smaller datasets, e.g., 1 GB.

yunyadbis commented 5 years ago

Thank you !! 👍 👍 @pyongjoo I look forward to your repairing this problem as soon as possible so that I can use Verdictdb. I used Spark 2.3.1, the version you recommend. Can I understand that version mismatch does not affect the use of verdict? I still have a question about where the files are stored. When the location of the spark-warehouse is not set. It will automatically generate the spark-warehouse folder both in the spark home directory and the current directory. When I set the location of the spark-warehouse in local file system using:

val spark = SparkSession
      .builder()
      .appName("Get true & raw answer. Local mode.")
      .config("spark.sql.warehouse.dir","file:///home/yunya/database/spark-warehouse")
      .enableHiveSupport()
      .getOrCreate()

It will generate the spark-warehouse folder in the current directory. When I set the location of the spark-warehouse in HDFS using:

val spark = SparkSession
      .builder()
      .appName("Get true & raw answer. Local mode.")
      .config("spark.sql.warehouse.dir","hdfs://hadoop9:9000/home/yunya/database/spark-warehouse")
      .enableHiveSupport()
      .getOrCreate()

It will generate a spark-warehouse folder in HDFS, but the metastore_db folder and derby.log still generate in the current local directory. So how to use Verdict in HDFS ? Thank you!

pyongjoo commented 5 years ago

VerdictDB version 0.5.8 has been released to the Maven Central. Can you try the new version if the same error occurs?

If you set Spark to use HDFS (as in the second approach), VerdictDB automatically uses HDFS as well because what VerdictDB does is simply sending different queries to Spark. Note that even for the HDFS configurations, its metadata will still be stored in regular databases (e.g., Derby, MySQL, etc.); this is how HDFS is designed.

yunyadbis commented 5 years ago

Thank you! You are so efficient 👍 @pyongjoo There are no error now. However,it takes 7 minutes to get a raw answer from VerdictDB in a 1GB table. It's even slower than Spark(several seconds). I put data in all nodes and the program is processing in Standalone mode.Will the location of the spark-warehouse directory affect the execution efficiency of VerdictDB? I mean HDFS or local.

When I set Spark to use HDFS , it will generate a spark-warehouse directory in HDFS and metastore_db directory in the current directory. I tried read 10000 sql statements from a file and process them using VerdictDB, and it quickly returned a raw answer of the first sql statement, then it showed an error when processing the second one.

2018-12-21 14:51:07 ERROR RetryingHMSHandler:159 - AlreadyExistsException(message:Database verdictdbtemp already exists)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at com.sun.proxy.$Proxy16.create_database(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
    at com.sun.proxy.$Proxy17.createDatabase(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:303)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:303)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:303)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
    at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
    at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
    at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
    at org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:302)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:164)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateDatabase$1.apply(HiveExternalCatalog.scala:164)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateDatabase$1.apply(HiveExternalCatalog.scala:164)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateDatabase(HiveExternalCatalog.scala:163)
    at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createDatabase(ExternalCatalog.scala:69)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:207)
    at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:190)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
    at org.verdictdb.connection.SparkConnection.executeSingle(SparkConnection.java:159)
    at org.verdictdb.connection.SparkConnection.execute(SparkConnection.java:148)
    at org.verdictdb.connection.CachedDbmsConnection.execute(CachedDbmsConnection.java:49)
    at org.verdictdb.connection.DbmsConnection.execute(DbmsConnection.java:41)
    at org.verdictdb.coordinator.SelectQueryCoordinator.process(SelectQueryCoordinator.java:128)
    at org.verdictdb.coordinator.ExecutionContext.streamSelectQuery(ExecutionContext.java:314)
    at org.verdictdb.coordinator.ExecutionContext.sqlSelectQuery(ExecutionContext.java:237)
    at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:160)
    at org.verdictdb.VerdictContext.sql(VerdictContext.java:388)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:121)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:116)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at example.GetTrueRawAnswer$.writeTrueRawResultToTxtFile(GetTrueRawAnswer.scala:116)
    at example.GetTrueRawAnswer$.main(GetTrueRawAnswer.scala:36)
    at example.GetTrueRawAnswer.main(GetTrueRawAnswer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Exception in thread "main" org.verdictdb.exception.VerdictDBDbmsException: Issued the following query: create schema verdictdbtemp
org.apache.hadoop.hive.metastore.api.AlreadyExistsException: Database verdictdbtemp already exists;
    at org.verdictdb.connection.SparkConnection.executeSingle(SparkConnection.java:166)
    at org.verdictdb.connection.SparkConnection.execute(SparkConnection.java:148)
    at org.verdictdb.connection.CachedDbmsConnection.execute(CachedDbmsConnection.java:49)
    at org.verdictdb.connection.DbmsConnection.execute(DbmsConnection.java:41)
    at org.verdictdb.coordinator.SelectQueryCoordinator.process(SelectQueryCoordinator.java:128)
    at org.verdictdb.coordinator.ExecutionContext.streamSelectQuery(ExecutionContext.java:314)
    at org.verdictdb.coordinator.ExecutionContext.sqlSelectQuery(ExecutionContext.java:237)
    at org.verdictdb.coordinator.ExecutionContext.sql(ExecutionContext.java:160)
    at org.verdictdb.VerdictContext.sql(VerdictContext.java:388)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:121)
    at example.GetTrueRawAnswer$$anonfun$writeTrueRawResultToTxtFile$1.apply(GetTrueRawAnswer.scala:116)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at example.GetTrueRawAnswer$.writeTrueRawResultToTxtFile(GetTrueRawAnswer.scala:116)
    at example.GetTrueRawAnswer$.main(GetTrueRawAnswer.scala:36)
    at example.GetTrueRawAnswer.main(GetTrueRawAnswer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I also tried in local,it showed the same error. Whenever a new query comes, the system will create a verdictdbtemp? I want to know how to submit multiple queries. Could you help me with the problem? Thanks!

koksen commented 5 years ago

Hi, My test also showed that spark sql is faster then verdict sql in the sample app? Did someone find a solution?

pyongjoo commented 5 years ago

@yunyadbis Sorry for my slow response.

@Beastjoe Can you take a look at this issue regarding Spark? I think we need to some performance tuning for Spark since it doesn't support concurrent queries very well.

Beastjoe commented 5 years ago

I will take a look.