spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Is the environment variable required? #41

Open fmsbeekmans opened 7 years ago

fmsbeekmans commented 7 years ago

Locally sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>") works but on a spark-yarn-2.0.1 I'm getting Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information..

Is the env variable required for use in the cluster or is it not finding the file and using some kind of fallback mechanism?

ravwojdyla commented 7 years ago

Hi @fydio. Take a look at this https://github.com/spotify/spark-bigquery/issues/12

fmsbeekmans commented 7 years ago

Hey, I'm not using Databricks. I've read through the solution but I can't get it to work just yet. I've SSH-ed the key file to the spark machines and the spark users can read the file. I've tried using the core-site.xml file, supply the --conf spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS argument to spark-submit but keep getting

java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
ravwojdyla commented 7 years ago

@fydio it doesn't matter much if you use databricks or not. Could you show an minimal example of how you use the client? Is the error coming from the driver or executors?

fmsbeekmans commented 7 years ago

Very stripped down piece of the code:

    import com.spotify.spark.bigquery._
    import sqlContext.implicits._

    sqlContext.setGcpJsonKeyFile(accountJsonPath)
    sqlContext.setBigQueryGcsBucket(bigQueryTemporaryBucket)
    sqlContext.setBigQueryDatasetLocation(bigQueryDatasetLocation)
    sqlContext.setBigQueryProjectId(gcpProjectId)

    rdd
      .map(encodeWithSqlTimestamp)
      .toDF
      .saveAsBigQueryTable(
        "project:dataset.table"
      )

Submitted with

spark-submit \          
  --master spark://master \
  --class JobRunner \
  --conf spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS="/home/spark/account.json" \
  --driver-memory 6G \
  --executor-memory 6G \
  --total-executor-cores 12 \
  --executor-cores 4 \
  --deploy-mode cluster \
  --driver-java-options "-Daws.access_key=*** -Daws.secret_key=*** -Dspark.worker.cleanup.enabled=false"\
   s3a://bucket/job.jar

And it looks like a driver, but here's how I got to the logs Running applications > Executor Summary > Finished Drivers > driver-20170901162346-0073 (First one on the list) > stderr

holamap commented 6 years ago

I am facing same issue, I tried setting spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS=/jsonkeypath as well as --conf spark.yarn.appMasterEnv.GOOGLE_APPLICATION_CREDENTIALS=/jsonkeypath . In both cases same error is failing the spark job. What is standard way to set this if I am using spark.master yarn ?

KumariTu270 commented 4 years ago

I m also facing the same issue while running locally , can you please help me @fmsbeekmans .

fmsbeekmans commented 4 years ago

It's been a while so I'm not a hundred percent sure. I think we ended up using Google's managed solution instead.

gbhattachan commented 3 years ago

I have got working Spark job in Hadoop with following JSON_KEY file path settings. In my case, though, the JSON KEY FILE is in Hadoop cluster.
spark-submit \ --master yarn \ --deploy-mode cluster \ --name "BigQueryDataLoader" \ --class com.xxx.xx --driver-memory 4G \ --jars ${HDFS_JAR_PATH}/gcs-connector-1.9.4-hadoop2.jar,${HDFS_JAR_PATH}/spark-bigquery-with-dependencies_2.11-0.19.1.jar \ --conf spark.dynamicAllocation.maxExecutors=50 \ --conf spark.hadoop.google.cloud.auth.service.account.enable=true \ --files ${FILE_PATH_TO_JSON_KEY} \ --conf spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS=${JSON_KEY_FILE_NAME} \ --conf spark.yarn.appMasterEnv.GOOGLE_APPLICATION_CREDENTIALS=${JSON_KEY_FILE_NAME} \ /export/home/XX/MyJarFile.jar \