samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

File not found + HiveContext Issue #8

Closed brencklebox closed 7 years ago

brencklebox commented 7 years ago

Hi,

I'm trying to use the connector you wrote to read/write from BigQuery on Databricks. Very much appreciate the effort the make Databricks and BigQuery play nicely together. Im trying to get this to work on a minimal Databricks Cluster (so just one driver and one worker), using Spark 2.1, and Scala 2.11 (not sure if those are requirements, but they seemed like safe bets).

I've followed the instructions in Databricks.md to install the library sucessfully and modified my init script to copy the credentials file from s3 to /home/ubuntu/databricks/filename.json. After setup , I am able to import the module correctly by running the code provided in the markdown file.

The issue comes when I try to read a table. Based on the suggestion in readme.md I'm trying:

val table = spark.sqlContext.bigQueryTable("project:dataset.table")

and I get the following response:

<console>:37: error: value bigQueryTable is not a member of org.apache.spark.sql.SQLContext
       val table = spark.sqlContext.bigQueryTable("project:dataset.test")

the same thing happens if I use a public set or if I copy a public table into my dataset.

I'm pretty new to scala (mostly use pyspark) so I could be missing something obvious here, but shouldn't the module be adding bigQueryTable to the hive context?

samelamin commented 7 years ago

Thanks for the kind words!

That sounds to me like the jar aren't attached to the clusters. It can't find a reference to the bigquery classes

Are you importing the libraries in your notebook?

Does the import command run successfully?

I'd try it out with the scala code and see if it works. I'm away from my laptop now but can have a look tonight on the Pyspark part On Thu, 23 Mar 2017 at 18:19, brencklebox notifications@github.com wrote:

Hi,

I'm trying to use the connector you wrote to read/write from BigQuery on Databricks. Very much appreciate the effort the make Databricks and BigQuery play nicely together. Im trying to get this to work on a minimal Databricks Cluster (so just one driver and one worker), using Spark 2.1, and Scala 2.11 (not sure if those are requirements, but they seemed like safe bets).

I've followed the instructions in Databricks.md to install the library sucessfully and modified my init script to copy the credentials file from s3 to /home/ubuntu/databricks/filename.json. After setup , I am able to import the module correctly by running the code provided in the markdown file.

The issue comes when I try to read a table. Based on the suggestion in readme.md I'm trying:

val table = spark.sqlContext.bigQueryTable("project:dataset.table")

and I get the following response:

:37: error: value bigQueryTable is not a member of org.apache.spark.sql.SQLContext val table = spark.sqlContext.bigQueryTable("project:dataset.test") the same thing happens if I use a public set or if I copy a public table into my dataset. I'm pretty new to scala (mostly use pyspark) so I could be missing something obvious here, but shouldn't the module be adding bigQueryTable to the hive context? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
brencklebox commented 7 years ago

Sorry for the confusion, I'm doing all of this in scala, just copying the code from the readmes. Haven't tried to introduce any Python yet.

I do have an update: writing to bigQuery seems to work with:

val numDS = spark.range(5, 100, 5)
val df = numDS.describe()
df.saveAsBigQueryTable("project:test.random")

which does in fact make the correct "random" table in my project, dataset test. So the issue seems to be limited to the bigQueryTable read function

samelamin commented 7 years ago

Ah ok well that clears things up. I'll have a look at the bigqueyTable when I get home

In the mean time you can probably use the sql select function to run a sql statement over the entire table On Thu, 23 Mar 2017 at 18:35, brencklebox notifications@github.com wrote:

Sorry for the confusion, I'm doing all of this in scala, just copying the code from the readme's. Haven't tried to introduce any Python yet.

I do have an update: writing to bigQuery seems to work with:

val numDS = spark.range(5, 100, 5) val df = numDS.describe() df.saveAsBigQueryTable("project:test.random")

which does in fact make the correct "random" table in my project, dataset test. So the issue seems to be limited to the bigQueryTable read function

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/samelamin/spark-bigquery/issues/8#issuecomment-288819977, or mute the thread https://github.com/notifications/unsubscribe-auth/AEHLm13nOpfxiLVMrDkrkPD1Vd-milVnks5roruBgaJpZM4MnKkC .

samelamin commented 7 years ago

@brencklebox I was able to reproduce the issue, I have just released a fix to spark packages and you should see it shortly. V0.1.4

Give it a go and give a me a shout if you have any more issues

Thanks for raising this!

Really would appreciate it if you rate the package when you get a chance

samelamin commented 7 years ago

@brencklebox are you still facing any issues?

brencklebox commented 7 years ago

nope!

Everything is working perfectly, thanks!

samelamin commented 7 years ago

perfect, if you get a chance please do rate the package!