samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

Import Error with Databricks #48

Closed mayankshah891 closed 6 years ago

mayankshah891 commented 6 years ago

I am attempting to read a table from BigQuery and write it to another database.

On my first few tries, I was able to read the table from BigQuery, but when I attempted to write said table to the database, I got the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage 13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Then, I checked our server, and it had actually worked! So I figured never mind, let me run the process all the way through again.

However, now I can't even read the initial table from BigQuery, with the exact same code. It returns this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1!

Any clues? I'm using Databricks to import a table from BigQuery. The cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala 2.11.

Thank you so much for your time and expertise

samelamin commented 6 years ago

Hi

Seems to be a problem with the transaction log. It's hard to be sure without looking at the full stack trace

The library is built using spark 2.1. I've been meaning to upgrade to 2.2 and I'll be doing that over the next few days

I would say try aiming it at a 2.1 cluster or build a custom package and see how it goes

Regards Sam

On Tue, 14 Nov 2017 at 23:50, mayankshah891 notifications@github.com wrote:

I am attempting to read a table from BigQuery and write it to another database.

On my first few tries, I was able to read the table from BigQuery, but when I attempted to write said table to the database, I got the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage 13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Then, I checked our server, and it had actually worked! So I figured never mind, let me run the process all the way through again.

However, now I can't even read the initial table from BigQuery, with the exact same code. It returns this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1!

Any clues? I'm using Databricks to import a table from BigQuery. The cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala 2.11.

Thank you so much for your time and expertise

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm6Fir7GYmnquj-vnDufO6AT2oNxlks5s2gsZgaJpZM4QeDPk .

mayankshah891 commented 6 years ago

Wow! thank you for the incredibly fast response! I'm launching a 2.1 cluster now, will let you know how it goes.

EDIT: In case this matters, here is the complete error:

at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:249) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:243) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1676) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1664) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1663) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1663) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:930) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:930) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:930) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1896) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1847) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1835) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:732) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2076) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:361) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2215) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2215) at org.apache.spark.sql.Dataset$$anonfun$56.apply(Dataset.scala:2909) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2908) at org.apache.spark.sql.Dataset.head(Dataset.scala:2215) at org.apache.spark.sql.Dataset.take(Dataset.scala:2428) at org.apache.spark.sql.Dataset.showString(Dataset.scala:247) at org.apache.spark.sql.Dataset.show(Dataset.scala:644) at org.apache.spark.sql.Dataset.show(Dataset.scala:603) at org.apache.spark.sql.Dataset.show(Dataset.scala:612) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:2) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:55) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:57) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:59) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:61) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:63) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:65) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw.(command-203108:67) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw.(command-203108:69) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw.(command-203108:71) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw.(command-203108:73) at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw.(command-203108:75) at line72e45f3548f94f05b4f11df9fae814ff57.$read.(command-203108:77) at line72e45f3548f94f05b4f11df9fae814ff57.$read$.(command-203108:81) at line72e45f3548f94f05b4f11df9fae814ff57.$read$.(command-203108) at line72e45f3548f94f05b4f11df9fae814ff57.$eval$.$print$lzycompute(:7) at line72e45f3548f94f05b4f11df9fae814ff57.$eval$.$print(:6) at line72e45f3548f94f05b4f11df9fae814ff57.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:186) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:182) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:182) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:182) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:456) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:410) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:182) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:234) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:215) at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:188) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:183) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:39) at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:221) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:39) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:215) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:601) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:601) at scala.util.Try$.apply(Try.scala:192) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:596) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:486) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:554) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:348) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at

org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:249) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:243) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

mayankshah891 commented 6 years ago

Update:

So using a v2.1 cluster fixed the initial issue, in that I can read data from BigQuery into Spark. The table does indeed display (!)

Writing to the other database is still producing this weird error, and this time no data came through (besides the schema):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Here is the complete error:

Py4JJavaError Traceback (most recent call last)

in () ----> 1 Events_data.write.format("jdbc").option("url", "jdbc:mapd:e*************st-2.compute.amazonaws.com:9091:mapd").option("driver", "com.mapd.jdbc.MapDDriver").option("dbtable", "randomevents").option("user", "mapd").option("password", "**************").mode("overwrite").save()

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 546 self.format(format) 547 if path is None: --> 548 self._jwrite.save() 549 else: 550 self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in call(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, kw) 61 def deco(*a, *kw): 62 try: ---> 63 return f(a, kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError(

Py4JJavaError: An error occurred while calling o186.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1429) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1612) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1937) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1950) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1977) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:60) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2777) at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2322) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:488) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:217) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:209) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

samelamin commented 6 years ago

Hey sorry I'm at a conference so won't have a chance to really look at this until the conference

So just to clarify this is a fresh cluster with the Google creds json in it ( it must be since you wouldn't be able to read if it wasn't)

Does it fail everytime or after a while? Also is it a structured streaming save or a batch save? I.e one off save to bigquery?

On Wed, 15 Nov 2017 at 00:08, mayankshah891 notifications@github.com wrote:

Update:

So using a v2.1 cluster fixed the initial issue, in that I can read data from BigQuery into Spark. The table does indeed display (!)

Writing to the other database is still producing this weird error, and this time no data came through (besides the schema):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Here is the complete error:

Py4JJavaError Traceback (most recent call last) in () ----> 1 Events_data.write.format("jdbc").option("url", "jdbc:mapd:ec2-18-220-187-39.us-east-2.compute.amazonaws.com:9091:mapd").option("driver", "com.mapd.jdbc.MapDDriver").option("dbtable", "InAppEvents3").option("user", "mapd").option("password", "i-053bdfd74c7ab374b").mode("overwrite").save()

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 546 self.format(format) 547 if path is None: --> 548 self._jwrite.save() 549 else: 550 self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in call(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, kw) 61 def deco(*a, *kw): 62 try: ---> 63 return f(a, kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError(

Py4JJavaError: An error occurred while calling o186.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1429) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1612) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1937) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1950) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1977) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:60) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2777) at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2322) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:488) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:217) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:209) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48#issuecomment-344415456, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm2hkIF7aeUYP6_9F8TO2MGdFEithks5s2g9CgaJpZM4QeDPk .

mayankshah891 commented 6 years ago

Its a fresh cluster, I have the credentials saved via json correctly (I'm assuming bc I was able to read it at least once).

Now every time I fire up the cluster, it fails to read the table initially. It is a batch save. I was going to run it every night.

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.102.245.245, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000004.avro' with index 4, which isn't less than or equal to than endFileNumber 2!

samelamin commented 6 years ago

that seems to be a problem with the data being read in, the connector doest read or use avro.

Is the data being read avro?

On Wed, Nov 15, 2017 at 5:52 PM, mayankshah891 notifications@github.com wrote:

Its a fresh cluster, I have the credentials saved via json correctly (I'm assuming bc I was able to read it at least once).

Now every time I fire up the cluster, it fails to read the table initially. It is a batch save. I was going to run it every night.

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.102.245.245, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000004.avro' with index 4, which isn't less than or equal to than endFileNumber 2!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48#issuecomment-344635883, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm6oqLg74dQH1l8L5b3S5ZLRVCStjks5s2wjFgaJpZM4QeDPk .

mayankshah891 commented 6 years ago

How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once.

samelamin commented 6 years ago

Can you make sure you are using my connector and not the spotify one? I believe they convert to avro

On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 notifications@github.com wrote:

How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48#issuecomment-344961355, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk .

samelamin commented 6 years ago

actually my mistake, digging into the code I see the google hadoop connector is using avro

https://github.com/samelamin/spark-bigquery/blob/ ed716c8cc5b2f1fdf0d57c0e8da7d7d6e1bc0925/src/main/scala/com/ samelamin/spark/bigquery/BigQuerySQLContext.scala#L89

I am not near my laptop at the moment but I will have a look tomorrow

In the meantime try restart/fresh cluster just to make sure its nothing to do with the state, the connector def works on Databricks I am just not sure why only you are facing the problem. My only guess is that its something with the cluster set up

On Thu, Nov 16, 2017 at 6:16 PM, Sam Elamin hussam.elamin@gmail.com wrote:

Can you make sure you are using my connector and not the spotify one? I believe they convert to avro

On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 notifications@github.com wrote:

How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48#issuecomment-344961355, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk .

mayankshah891 commented 6 years ago

Okay thank you so much for your time and for building this in the first place. I'll try some different clusters and setups and we'll see. Thank you again

mayankshah891 commented 6 years ago

I've tried a bunch of different clusters, to mixed results. I can read in the data sometimes, but if I try to call .show(200) on the data, it fails. Again the error seems to be something to do with partitions? Databricks seems to believe it is done reading the data, but then another file pops up that is past the "endFileNumber", then it throws the error

Someone else was running into an error like this here, but I haven't cracked how to solve it myself yet:

https://stackoverflow.com/questions/46404081/illegalstateexception-on-google-cloud-dataproc

EDIT: To recap: I'm reading from a bigquery table that is being updated daily. It is a batch read, which I then attempt to view fully and write to another server. So far it intermittently fails at each step of the process.

mayankshah891 commented 6 years ago

Sam, I think I figured out the problem. The bigquery table I'm reading from is streaming new data every few seconds, so the lazy transformations in Spark are breaking the code because the file size keeps changing! I just need to call .persist or .cache and hold the data in memory and then it works! (Well, sort of works, its not writing properly but I think that can be fixed as well)

Tentatively I think this issue can be closed. Thanks again

samelamin commented 6 years ago

Hi no worries glad we got there in the end

Had no idea you were streaming!

Can you close the issue please On Fri, 17 Nov 2017 at 19:59, mayankshah891 notifications@github.com wrote:

Sam, I think I figured out the problem. The bigquery table I'm reading from is streaming new data every few seconds, so the lazy transformations in Spark are breaking the code because the file size keeps changing! I just need to call .persist or .cache and hold the data in memory and then it works! (Well, sort of works, its not writing properly but I think that can be fixed as well)

Tentatively I think this issue can be closed. Thanks again

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/48#issuecomment-345317207, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm3YmlIlyX83oYAzMczY35ZpndXWgks5s3cmEgaJpZM4QeDPk .