samelamin / spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Apache License 2.0
70 stars 28 forks source link

Databricks - GCP Key Missing in Multi-Node Cluster #18

Closed kurtmaile closed 7 years ago

kurtmaile commented 7 years ago

Hi Sam,

I ran into an interesting issue today Im not sure of the cause but suspect its me - its just happened so need to do some deeper digging but thought to reach out to you in case you spotted anything a rookie mistake quickly and could help.

My notebook in the community edition works great and writes my dataframes out to the designated BigQuery tables as expected. All good.

Now I've recently subscribed to a paid databricks account, I run the exact same code in a simple 2 node cluster (1 driver + 1 worker) and I get an exception that the google keyfile cannot be located.

https://www.dropbox.com/s/6kpsqsgqkb2tp2a/Screenshot%202017-05-09%2019.28.39.png?dl=0

but the key is there as can be seen by this:

https://www.dropbox.com/s/yzviehngyhlz5e6/Screenshot%202017-05-09%2019.29.43.png?dl=0

As mentioned its using exactly the same code in both editions - the way for prototyping I set up my key was in the notebook to just save the string of the json key to the local file system using below

dbutils.fs.put("file:/gcpkey.json",gcpKey)

and to the BQ library reference as "/gcpkey.json" - which as mentioned works fine in community editon.

The difference between community and the the proper subscription as you know is on community the driver and worker is a single physical node. I can only guess at the moment (as this just happended) that the key file isnt on the worker in a multi-node environment, this dbutils command just saves on the driver node and I need to distribute it to the worker? Documentation on it isnt so clear.

I assume youve had this working on a multi-node cluster on dbricks so no doubt its my rookie mistake on the platform! Is this the issue in that it just writes to the Driver node or something else? What are your best pracitces for distributing a senstive key like this securely?

I'm assuming there is a better mechanism for production using the Databricks REST API to bootstap a cluster with this (to be investigated), but for now was just using this simple mechanism.

Thoughts?

Thanks heaps for your help! Cheers

vijaykramesh commented 7 years ago

not sure how to accomplish this in a notebook, but I've found I have to include my json key with the --files /path/to/key.json when running spark-submit. that should get it on all the worker nodes where it is readable.

samelamin commented 7 years ago

Hi guys

@kurtmaile you are absolutely right, the worker node cannot find the gcp file.

What you can do is create an init script which will load the credentials file when a cluster is created

Here is an example script

dbutils.fs.rm("dbfs:/databricks/init/copy-google-key-json.sh",true)
dbutils.fs.put("/databricks/init/copy-google-key-json.sh","""
#!/bin/bash 
cat <<EOF > /databricks/key.json
{paste json here}
EOF
""")

I hope this helps!

kurtmaile commented 7 years ago

Hi guys,

Awesome thanks for this info this helps greatly! I will give the init feature a go today, much appreciated once again for the speedy response.

Cheers