Sometime loading avro files to a BigQurey table fails

yu-iskw commented 7 years ago

Sometime loading avro files to a BigQuery table fails, since a _temporary directory on GCS doesn't exist.

16/12/02 23:46:47 INFO com.spotify.spark.bigquery.BigQueryClient: Loading gs://spark-helper-us-region/hadoop/tmp/spark-bigquery/spark-bigquery-1480722354001=863039906 into sage-shard-740:analytics_us.activities_20160903
Exception in thread "main" java.io.IOException: Not found: Uri gs://spark-helper-us-region/hadoop/tmp/spark-bigquery/spark-bigquery-1480722354001=863039906/_temporary/0/_temporary/attempt_201612022346_0113_m_000065_0/part-r-00065-0f64344d-0f7e-4677-a28b-56e79a287e41.avro
    at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:95)
    at com.spotify.spark.bigquery.BigQueryClient.com$spotify$spark$bigquery$BigQueryClient$$waitForJob(BigQueryClient.scala:134)
    at com.spotify.spark.bigquery.BigQueryClient.load(BigQueryClient.scala:130)
    at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:150)
    at com.spotify.spark.bigquery.package$BigQueryDataFrame.saveAsBigQueryTable(package.scala:159)
    at com.mercari.spark.sql.SparkBigQueryHelper$.saveBigQueryTableByDataFrame(SparkBigQueryHelper.scala:229)
    at com.mercari.spark.sql.SparkBigQueryHelper.saveBigQueryTableByDataFrame(SparkBigQueryHelper.scala:66)
    at com.mercari.spark.batch.ActivitiesTableCreator$.apply(ActivitiesTableCreator.scala:226)
    at com.mercari.spark.batch.ActivitiesTableCreator$.main(ActivitiesTableCreator.scala:210)
    at com.mercari.spark.batch.ActivitiesTableCreator.main(ActivitiesTableCreator.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

jonas commented 7 years ago

Encountered a lot of these today. Looks like some consistency issue involved between the GCS write and the BQ read.

nevillelyh commented 7 years ago

Is that caused by the "eventual consistency" behavior of GCS list operation?

spotify / spark-bigquery

Sometime loading avro files to a BigQurey table fails #29