spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

spark-bigquery jar is useable #30

Closed ravwojdyla closed 7 years ago

ravwojdyla commented 7 years ago

I'm trying to use spark-bigquery as a dependency by either:

scalaVersion  := "2.11.8"
val sparkVersion = "2.0.0"

resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "spotify" % "spark-bigquery" % "0.1.2-s_2.11"
  )

or

val sparkVersion = "2.0.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "com.spotify" %% "spark-bigquery" % "0.1.2"
  )

(Both resolved dependency just fine)

and when I try to use the classes/methods provided by spark-bigquery I get either:

which is a manifestation of the same problem. org.apache.spark.sql.DataFrame is not a class, but an alias:

type DataFrame = org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]

Unfortunately the jar provided by maven/spark-packages does not translate the alias, instead uses org.apache.spark.sql.DataFrame in the compiled code, therefor my project gets confused, and can't load org.apache.spark.sql.DataFrame because it does not exist (and should not).

There seems to be a problem in the way spark-packages is building/publishing jars. If I package spark-bigquery locally, and decompile it (for example BigQuerySQLContext), we can see that locally compiled code is in fact using org.apache.spark.sql.Dataset:

➜  jar xf spark-bigquery_2.11-0.2.0-SNAPSHOT.jar
➜  javap -v com/spotify/spark/bigquery/package\$BigQuerySQLContext.class | grep bigQueryTable | grep NameAndType
   #87 = NameAndType        #85:#86       // bigQueryTable:(Lcom/google/api/services/bigquery/model/TableReference;)Lorg/apache/spark/sql/Dataset;

while jar from maven/spark-packages is still using org.apache.spark.sql.DataFrame in the compiled code:

➜  jar -xf spark-bigquery-0.1.1-s_2.11.jar
➜  javap -v com/spotify/spark/bigquery/package\$BigQuerySQLContext.class | grep bigQueryTable | grep NameAndType
   #79 = NameAndType        #77:#78       // bigQueryTable:(Lcom/google/api/services/bigquery/model/TableReference;)Lorg/apache/spark/sql/DataFrame;

at this point org.apache.spark.sql.DataFrame is expected to be a class and classloader gets confused -> throws error.

One way to solve it might be to use Dataset explicitly, but honestly this seems like something that should be solved in spark-packages.

ravwojdyla commented 7 years ago

@brkyvz , could you please assist here? What is the process of building/publishing spark packages jars? Is the build process open source somewhere?

brkyvz commented 7 years ago

@ravwojdyla It seems to me that the 0.1.2 version of Spark BigQuery was built and published with Spark 1.6, and is not compatible with Spark 2.0, therefore either the maintainers should publish a new version of the library compiled against Spark 2.0, or you should in your project use Spark 1.6. Hence, I don't think this is a Spark Packages specific problem.

If you would like to use the library against Spark 2.0, and if the code is currently source compatible, you may build the library from source, and use that in your project.

ravwojdyla commented 7 years ago

@brkyvz :man_facepalming: you are absolutely right. thanks for pointing that out, and sorry for bothering you. Closing this.