spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Unable to read table .. can write .. Dataproc scala #43

Closed darylerwin closed 6 years ago

darylerwin commented 7 years ago

I am new to this architecture and have read many articles on various dependencies and such not working. Can someone point out where I might have gone wrong? Lots of trial and error in this build.sbt Spark 2.2.0 Scala 2.11.8

build.sbt:

version := "2.0.2"
scalaVersion := "2.11.8"
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
  artifact.name + "." + artifact.extension
}

resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += "jitpack" at "https://jitpack.io"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11
//TRYlibraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.2.0
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"

// https://spark-packages.org/package/spotify/spark-bigquery
libraryDependencies += "com.spotify" % "spark-bigquery_2.11" % "0.2.1"

// https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11
// https://github.com/databricks/spark-avro
//V1 //libraryDependencies += "com.databricks" % "spark-avro_2.11" % "3.0.0"
//V2 libraryDependencies += "com.github.databricks" % "spark-avro" % "204864b6cf"
libraryDependencies += "com.databricks" %% "spark-avro" % "3.0.0" 

The Error:

        at com.spotify.spark.bigquery.BigQuerySQLContext.bigQueryTable(BigQuerySQLContext.scala:116)
        at com.spotify.spark.bigquery.BigQuerySQLContext.bigQuerySelect(BigQuerySQLContext.scala:93)
        at Query$.main(myquery.scala:19)
        at Query.main(myquery.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/09/09 03:02:55 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@30c31dd7{HTTP/1.1}{0.0.0.0:4040}
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [4947fcc7-1bb4-4db0-9d33-9b9d67d88db8] entered state [ERROR] while waiting for [DONE].

Also using the init code when Dataproc builds the cluster to replace the avro files.

rm -rf /usr/lib/h{adoop,ive}*/{,lib/}*avro*.jar
# Consider staging these jars in GCS to avoid being throttled & be nice to Maven Central.
gsutil cp gs://prod-bigdata/jar/avro-1.7.7.jar /usr/lib/hadoop/lib
gsutil cp gs://prod-bigdata/jar/avro-mapred-1.7.7-hadoop2.jar /usr/lib/hadoop-mapreduce

Sample script attempting to run.. [some cutting and pasting here] I have tried both the direct Table call and the bigQuerySelect call. The Save DOES work..

import com.databricks.spark.avro._

object Query {
   def main(args: Array[String]) {
       val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
       import com.spotify.spark.bigquery._
       val table2 = spark.sqlContext.bigQuerySelect("SELECT * FROM [prod:data_analytics_poc.REGIONS]").limit(100)
       table2.show(20)
}
darylerwin commented 7 years ago

Able to run this via spark-shell on the master node:

spark-shell --packages com.spotify:spark-bigquery_2.11:0.2.1

Ivy Default Cache set to: /home/derwin/.ivy2/cache
The jars for the packages stored in: /home/derwin/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.spotify#spark-bigquery_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found com.spotify#spark-bigquery_2.11;0.2.1 in central
        found com.databricks#spark-avro_2.11;3.0.0 in central
        found org.slf4j#slf4j-api;1.7.5 in central
        found org.apache.avro#avro;1.7.6 in central
        found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
        found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
        found com.thoughtworks.paranamer#paranamer;2.3 in central
        found org.xerial.snappy#snappy-java;1.0.5 in central
        found org.apache.commons#commons-compress;1.4.1 in central
        found org.tukaani#xz;1.0 in central
        found com.google.cloud.bigdataoss#bigquery-connector;0.7.5-hadoop2 in central
        found com.google.cloud.bigdataoss#util-hadoop;1.4.5-hadoop2 in central
        found com.google.api-client#google-api-client-java6;1.20.0 in central
        found com.google.api-client#google-api-client;1.20.0 in central
        found com.google.oauth-client#google-oauth-client;1.20.0 in central
        found com.google.http-client#google-http-client;1.20.0 in central
        found com.google.code.findbugs#jsr305;2.0.3 in central
        found org.apache.httpcomponents#httpclient;4.0.1 in central
        found org.apache.httpcomponents#httpcore;4.0.1 in central
        found commons-logging#commons-logging;1.1.1 in central
        found commons-codec#commons-codec;1.6 in central
        found com.google.http-client#google-http-client-jackson2;1.20.0 in central
        found com.fasterxml.jackson.core#jackson-core;2.1.3 in central
        found com.google.oauth-client#google-oauth-client-java6;1.20.0 in central
        found com.google.api-client#google-api-client-jackson2;1.20.0 in central
        found com.google.apis#google-api-services-storage;v1-rev35-1.20.0 in central
        found com.google.guava#guava;18.0 in central
        found com.google.cloud.bigdataoss#util;1.4.5 in central
        found com.google.cloud.bigdataoss#gcs-connector;1.4.5-hadoop2 in central
        found com.google.cloud.bigdataoss#gcsio;1.4.5 in central
        found com.google.apis#google-api-services-bigquery;v2-rev217-1.20.0 in central
        found com.google.code.gson#gson;2.3 in central
        found org.apache.avro#avro;1.7.7 in central
        found org.slf4j#slf4j-simple;1.7.21 in central
        found org.slf4j#slf4j-api;1.7.21 in central
        found joda-time#joda-time;2.9.3 in central

Is there any way for me to see what dataproc is using for the libraries ? or should I somehow code this in the build.sbt