spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

RedshiftException: Something unusual has occurred to cause the driver to fail when writing to Redshift via PySpark #125

Open RusabKhan opened 1 year ago

RusabKhan commented 1 year ago

Below is my code

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('redshift')\
    .config("spark.jars.packages","com.google.guava:guava:31.1-jre,com.eclipsesource.minimal-json:minimal-json:0.9.5,com.amazon.redshift:redshift-jdbc42:2.1.0.12,com.google.guava:guava:31.1-jre,com.amazonaws:aws-java-sdk:1.12.439,org.apache.spark:spark-avro_2.12:3.3.2,io.github.spark-redshift-community:spark-redshift_2.12:5.1.0,org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.access.key", "etc") \
    .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')\
    .config("spark.hadoop.fs.s3a.secret.key", "etc") \
    .getOrCreate()

import os 

os.environ["AWS_ACCESS_KEY_ID"] = "etc"
os.environ["AWS_SECRET_ACCESS_KEY"] = "etc"

df.write \
  .format("io.github.spark_redshift_community.spark.redshift") \
  .option("url", "jdbc:redshift:iam://host:5439/dev?user=user&password=pass") \
  .option("dbtable", "demo") \
  .option("aws_iam_role", "arn:aws:iam::etc") \
  .option("tempdir", "s3a://etc") \
  .mode("append") \
  .save()

This throws the exception

py4j.protocol.Py4JJavaError: An error occurred while calling o62.save.
: com.amazon.redshift.util.RedshiftException: Something unusual has occurred to cause the driver to fail. Please report this exception:Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and roleSessionName, com.amazonaws.auth.profile.ProfileCredentialsProvider@2c690608: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@708c38d5: Failed to connect to service endpoint: ]

I have also tried using forward_s3_crdentials true. This role has full permission to redshift and s3. I have written multiple files and ran copy command manually via the psycopg2 library however with the spark it doesn't work.

Environment Spark 3.2.3 Scala 2.12 Pyspark 3.2.3 Java 11 Ubuntu

88manpreet commented 1 year ago

What happens when you try com.amazonaws.auth.EnvironmentVariableCredentialsProvider instead?

RusabKhan commented 1 year ago

.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')

With the above settings at first, it wasn't loading environment variables set through code

os.environ["AWS_ACCESS_KEY_ID"] = 'etc'
os.environ["AWS_SECRET_KEY"] = 'etc'

Then I set my variables through the export command in the terminal and retried and it resulted in the below error.

23/04/03 16:44:35 WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
23/04/03 16:45:39 ERROR RedshiftWriter: Exception thrown during Redshift load; will roll back transaction
java.sql.SQLException: Exception thrown in awaitResult: 
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:133)
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:109)
.......
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: Query cancelled.
        at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2601)
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2269)

The database is accessible and I can write to it via the psycopg2 library and even ran some copy commands after writing to s3 via boto3.

smoy commented 1 year ago

@RusabKhan I just turned on secrets scanning in the repository, and GitHub reported possibly secret leak when you initially open the issue. Please make sure you rotate your tokens if it's still active. Despite you edit it, GitHub basically has some logs of it.