spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Impossible to use spark-bigquery through Datalab (Python notebook) deployed on a Dataproc #39

Open oliviernguyenquoc opened 7 years ago

oliviernguyenquoc commented 7 years ago

Hi everyone,

If I understand well, I can use this package with pySpark. Is it correct ?

Moreover, I am not able to import the package in my python notebook (Datalab): "No module found" error.

I have tried to make a initialization action in dataproc (https://cloud.google.com/dataproc/docs/concepts/init-actions) without success.

Any help ?

richwhitjr commented 7 years ago

It is possible to use this library in pyspark but is not super easy right now. You have to use the py4j gateway often to setup the classes correctly on the jvm side and wrap the returned dataframe manually.

Could you give an example on how you are trying to use it? I good place to start would be to see if you can call the java classes in python and make sure those are loaded correctly.

session = SparkSession.builder.getOrCreate()

bq = session._sc._jvm.com.spotify.spark.bigquery.BigQuerySQLContext(session._wrapped._jsqlContext)
bq.setBigQueryDatasetLocation(...)
bq.setBigQueryProjectId(...)
bq.setBigQueryGcsBucket(...)
bq.setGcpJsonKeyFile(...)

If that works you should be able to query by calling into the scala function and wrapping the return jvm dataframe

df = DataFrame(bq.bigQuerySelect(sql), session._wrapped)