samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

Minimal working example #49

Closed gbordyugov closed 6 years ago

gbordyugov commented 6 years ago

Hi Sam,

being new to Spark, I'm wondering whether you happen to have a minimal working example, including a complete set of dependencies.

Best, Grisha

samelamin commented 6 years ago

You should fine a simple setup on both pyspark and scala on Readme page

On Wed, Nov 22, 2017 at 10:56 AM, Grigory Bordyugov < notifications@github.com> wrote:

Hi Sam,

being new to Spark, I'm wondering whether you happen to have a minimal working example, including a complete set of dependencies.

Best, Grisha

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/49, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm07aSiLJ1_bExSo6fsg8wEqpF8v5ks5s4_36gaJpZM4QnNok .

gbordyugov commented 6 years ago

Ok, I'm looking into the SBT-based example. When I'm trying to execute

sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

with the appropriate file name, I'm getting the following error message:

error: missing or invalid dependency detected while loading class file 'package.class'.
Could not access type DataFrame in package org.apache.spark.sql.package,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with '-Ylog-classpath' to see the problematic classpath.)
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.

I guess my dependencies are not complete. Here is my build.sbt:

resolvers += Opts.resolver.sonatypeReleases

scalaVersion := "2.11.8"

libraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.2"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"

initialCommands in console := s"""
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
// import com.samelamin.spark.bigquery._

val sc = new SparkContext("local[*]", "shell")
val sqlContext = new SQLContext(sc)
"""
samelamin commented 6 years ago

thats because it cant find the Spark dependencies. are you sure you have spark installed locally?

if not you can just get rid of the provided flag

On Wed, Nov 22, 2017 at 12:46 PM, Grigory Bordyugov < notifications@github.com> wrote:

Ok, I'm looking into the SBT-based example. When I'm trying to execute

sqlContext.setGcpJsonKeyFile("")

with the appropriate file name, I'm getting the following error message:

error: missing or invalid dependency detected while loading class file 'package.class'. Could not access type DataFrame in package org.apache.spark.sql.package, because it (or its dependencies) are missing. Check your build definition for missing or conflicting dependencies. (Re-run with '-Ylog-classpath' to see the problematic classpath.) A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.

I guess my dependencies are not complete. Here is my build.sbt:

resolvers += Opts.resolver.sonatypeReleases

scalaVersion := "2.11.8"

libraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.2" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"

initialCommands in console := s""" import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.SQLContext // import com.samelamin.spark.bigquery._

val sc = new SparkContext("local[*]", "shell") val sqlContext = new SQLContext(sc) """

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/49#issuecomment-346340165, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLmykBYw_M3z8zLXx6gMsGA1c1rfkcks5s5BeggaJpZM4QnNok .

gbordyugov commented 6 years ago

No, I don't have Spark installed. And the error persists w/o the "provided" flag. My current build.sbt:

resolvers += Opts.resolver.sonatypeReleases

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "com.github.samelamin" %% "spark-bigquery" % "0.2.2",
  "org.apache.spark"     %% "spark-core"     % "1.6.1",
  "org.apache.spark"     %% "spark-sql"      % "1.6.1")

initialCommands in console := s"""
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
// import com.samelamin.spark.bigquery._

val sc = new SparkContext("local[*]", "shell")
val sqlContext = new SQLContext(sc)
"""
samelamin commented 6 years ago

my best suggestion if to look at how read up online on getting spark installed locally, you cannot use the connector if you do not have spark installed

there are alot of tutorials online on how to get spark up and running

On Wed, Nov 22, 2017 at 1:38 PM, Grigory Bordyugov <notifications@github.com

wrote:

No, I don't have Spark installed. And the error persists w/o the "provided" flag. My current build.sbt:

resolvers += Opts.resolver.sonatypeReleases

scalaVersion := "2.11.8"

libraryDependencies ++= Seq( "com.github.samelamin" %% "spark-bigquery" % "0.2.2", "org.apache.spark" %% "spark-core" % "1.6.1", "org.apache.spark" %% "spark-sql" % "1.6.1")

initialCommands in console := s""" import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.SQLContext // import com.samelamin.spark.bigquery._

val sc = new SparkContext("local[*]", "shell") val sqlContext = new SQLContext(sc) """

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/49#issuecomment-346352149, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLmzkdC6LPVRZzDcVReI3p981A7XrEks5s5CPCgaJpZM4QnNok .

gbordyugov commented 6 years ago

Spark seems to be up and running - I can do smth like val a = sc.textFile("build.sbt"); a.count(), it spits out the correct numbers. I can see the Web frontend on localhost:4040 too.

samelamin commented 6 years ago

The error is basically saying it cannot load the spark dataframe class, so its a problem with the environment

perhaps you are using a different version of spark?

Can you post the full stack trace?

gbordyugov commented 6 years ago

that's all I get:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/22 15:08:26 INFO SparkContext: Running Spark version 1.6.1
17/11/22 15:08:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/22 15:08:27 INFO SecurityManager: Changing view acls to: gbordyugov
17/11/22 15:08:27 INFO SecurityManager: Changing modify acls to: gbordyugov
17/11/22 15:08:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(gbordyugov); users with modify permissions: Set(gbordyugov)
17/11/22 15:08:27 INFO Utils: Successfully started service 'sparkDriver' on port 51363.
17/11/22 15:08:28 INFO Slf4jLogger: Slf4jLogger started
17/11/22 15:08:28 INFO Remoting: Starting remoting
17/11/22 15:08:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.166.106.5:51364]
17/11/22 15:08:28 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 51364.
17/11/22 15:08:28 INFO SparkEnv: Registering MapOutputTracker
17/11/22 15:08:28 INFO SparkEnv: Registering BlockManagerMaster
17/11/22 15:08:28 INFO DiskBlockManager: Created local directory at /private/var/folders/hn/ps7r027n7h59fktcxdlhp53c2pv4z0/T/blockmgr-871cdd8d-c0d2-4492-b050-f0959ca9e629
17/11/22 15:08:28 INFO MemoryStore: MemoryStore started with capacity 491.6 MB
17/11/22 15:08:28 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/22 15:08:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/22 15:08:28 INFO SparkUI: Started SparkUI at http://10.166.106.5:4040
17/11/22 15:08:28 INFO Executor: Starting executor ID driver on host localhost
17/11/22 15:08:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51365.
17/11/22 15:08:28 INFO NettyBlockTransferService: Server created on 51365
17/11/22 15:08:28 INFO BlockManagerMaster: Trying to register BlockManager
17/11/22 15:08:28 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51365 with 491.6 MB RAM, BlockManagerId(driver, localhost, 51365)
17/11/22 15:08:28 INFO BlockManagerMaster: Registered BlockManager
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@249e3e3e
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@40c6f806
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151).
Type in expressions for evaluation. Or try :help.

scala> import com.samelamin.spark.bigquery._
import com.samelamin.spark.bigquery._

scala> sqlContext.setGcpJsonKeyFile("bq_credentials.json")
error: missing or invalid dependency detected while loading class file 'package.class'.
Could not access type DataFrame in package org.apache.spark.sql.package,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'package.class' was compiled against an incompatible version of org.apache.spark.sql.package.

scala>
samelamin commented 6 years ago

are you using spark 2.1? i suggest building an uber jar first using 'sbt clean assembly'

gbordyugov commented 6 years ago

I'm using an older version (1.6.1) of Spark, since my company cluster runs that version.

As a matter of fact, import org.apache.spark.sql.DataFrame succeeds, but the type doesn't appear to be visible when I'm running sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

samelamin commented 6 years ago

this package is targeted at 2.1 it will not work with 1.6 sorry

gbordyugov commented 6 years ago

oh, I see, I thought you implicitly accepted 1.6.1 as a valid version from the pasted build.sbt

Thanks for clearing this!