spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
136 stars 62 forks source link

Exception Status Code: 404, AWS Service: Amazon S3, when filtering result of dataframe is zero #43

Open jimmymaise opened 5 years ago

jimmymaise commented 5 years ago

After querying redshift, i have a dataframe with only one record

+------------------+--------------------+------- | accountid| accountname|rangeid +------------------+--------------------+------- |00139|Arizona Public Se...| null +------------------+--------------------+-------

If i do filiter to count non empty/non null of accountid. It's OK. I got expected result 1 If i do filiter to count non empty/non null of rangeid. It's NOK. I got unexpected result. Exception

: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS Service: Amazon S3, AWS Request ID: CFAC029B1646FCAF, AWS Error Code: NoSuchKey, AWS Error Message: The specified key does not exist., S3 Extended Request ID: jhq1m9AxK2iE1ICqTaqheieXOT3xbJb9j8nB2FDVo3gAN+CegbOdKcL+sBYjp73XQd2rS6wOOTY=

It happens the same for the zero result when filtering. Code:

df.filter(df[column].isNotNull() & (df[column] != "")).count()

Even i don't use count() but show(), i still get same error.

5 days ago, my code run OK. But now it throws exception.

If i save the data frame to file in local and read from it, It works well. But because i need to count this metric of the dataframe getting from redshift and compare with the metric of dataframe getting from a saved file, i still need to count directly from the data frame getting from redshift.

Now, i'm trying to try/catch to assign value 0 when getting exception, but actually, it's not a good way.

lucagiovagnoli commented 5 years ago

Please check #40, this is a duplicate

smoy commented 5 years ago

I want to circle back to give some feedback to this issue.

One of our teams in Yelp were affected by this problem (empty result set) as well. After upgrading their Redshift cluster to version 1.0.10936, the problem has gone away. Redshift has various version based on region, I think you need at least version 1.0.10880 or above.