samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

Import Error with Databricks from a table with streaming #53

Closed yohannnyc closed 6 years ago

yohannnyc commented 6 years ago

Hi,

In November of last year my colleague, mayankshah891 raised an issue (#48). We are tying to import data from a BigQuery table where data is streaming into it and we randomly have errors like described in the previous issue.

I have noticed that printing the (almost) full table after caching it helps:

val table = sqlContext.bigQueryTable("bigqueryprojectid:blabla.name_table").cache() table.show(100000)

It probably forces spark to persist the table and no more connection with BigQuery is required.

We closed issue #48 after mentioning that data was streaming into our big query table. That seemed to explain our problem. It seems that new data is coming in quite frequently (at least every 5 min).

Could you confirm that this is what causes our problem and do you have a more scientific way of getting around it?

Thank you so much for your help. Truly appreciated.

samelamin commented 6 years ago

yup its because of the table changing mid process by caching you are freezing the data (snapshot-ing) it I believe