Impossible to use spark-bigquery through Datalab (Python notebook) deployed on a Dataproc

It is possible to use this library in pyspark but is not super easy right now. You have to use the py4j gateway often to setup the classes correctly on the jvm side and wrap the returned dataframe manually.

Could you give an example on how you are trying to use it? I good place to start would be to see if you can call the java classes in python and make sure those are loaded correctly.

session = SparkSession.builder.getOrCreate()

bq = session._sc._jvm.com.spotify.spark.bigquery.BigQuerySQLContext(session._wrapped._jsqlContext)
bq.setBigQueryDatasetLocation(...)
bq.setBigQueryProjectId(...)
bq.setBigQueryGcsBucket(...)
bq.setGcpJsonKeyFile(...)

If that works you should be able to query by calling into the scala function and wrapping the return jvm dataframe

df = DataFrame(bq.bigQuerySelect(sql), session._wrapped)

spotify / spark-bigquery

Impossible to use spark-bigquery through Datalab (Python notebook) deployed on a Dataproc #39