spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

404 Not found exception when creating gcs directories #56

Open nicolasblaye opened 6 years ago

nicolasblaye commented 6 years ago

Hi everyone,

I encountered a weird issue while trying your library. When saving the temp file to gcs, it called the storage api with a weird address: http://google.api.address/null.

I tried debugging through the code to find what was causing the problem and I did not find it, however I solved the issue accidentally.

I wanted to test creating a directory with the security account to see if it was a permission problem, so I added google-cloud-storage in my dependencies because I couldn't import import com.google.cloud.storage.StorageOptions , and this solved the issue...

Is there a way to make this error more explicit? Is it a problem that is global to google and not this library?

Here is the build.sbt to reproduce the error

 "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2",
  "org.apache.spark" %% "spark-core" % "2.2.0",
  "org.apache.spark" %% "spark-sql" % "2.2.0",
  "com.google.cloud" % "google-cloud-bigquery" % "0.32.0-beta",
  "com.google.cloud.bigdataoss" % "gcs-connector" % "1.6.2-hadoop2",
  "com.spotify" % "spark-bigquery_2.11" % "0.2.2",
  "org.apache.parquet" % "parquet-avro" % "1.9.0"

And the code

object Main extends App {
    implicit val spark = SparkSession
      .builder()
      .appName("Name")
      .master("local[*]")
      .config("google.cloud.auth.service.account.json.keyfile", "/path")
      .config("fs.gs.project.id", "project-id")
      .getOrCreate()
    bqSqlContext.bigQuerySelect(s"SELECT * FROM ${tableName} LIMIT 10")
}

And here is the exception:

Exception in thread "main" com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
    at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
    at com.google.cloud.hadoop.gcsio.BatchHelper.flushIfPossible(BatchHelper.java:118)
    at com.google.cloud.hadoop.gcsio.BatchHelper.flush(BatchHelper.java:132)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfos(GoogleCloudStorageImpl.java:1493)
    at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.getItemInfos(ForwardingGoogleCloudStorage.java:221)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfos(GoogleCloudStorageFileSystem.java:1159)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:530)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:1382)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819)
    at com.google.cloud.hadoop.io.bigquery.AbstractExportToCloudStorage.prepare(AbstractExportToCloudStorage.java:59)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:123)
    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
    at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1368)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.first(RDD.scala:1367)
    at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:112)
    at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:93)
    at com.powerspace.bigquery.BigQueryExporter.read(BigQueryExporter.scala:24)

Cheers

aakoshh commented 6 years ago

@nicolasblaye I have the same issue with Spark 2.2.1 and gcs-connector 1.6.5-hadoop2.

When you said adding it to the dependencies resolved the issue, what did you mean? The dependency is already added in your build.sbt which you said reproduces the error.

nicolasblaye commented 6 years ago

@aakoshh I am talking about this deps google-cloud-storage which is not on my example. so com.google.cloud % google-cloud-storage % version

Maybe it was fixed, it's been a while

yanasega commented 6 years ago

This solved my issue as well, I used bigquery4s_2 (https://github.com/seratch/bigquery4s) dependency to use non legacy sql and the error appeared , adding google-cloud-storage solved it (I am writing this down in case anyone else will have this issue, or knows what is the root cause).