spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
137 stars 63 forks source link

How does unloadData() guard against eventually consistent S3 lists? #74

Closed nchammas closed 3 years ago

nchammas commented 4 years ago

I'm looking at this code: https://github.com/spark-redshift-community/spark-redshift/blob/84ebe9d5186370794c1c1dc82db9dba15679f9f9/src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftWriter.scala#L317-L319

How do we know that fs.listStatus() is listing all the files written to S3? There are no read-after-write consistency guarantees for S3 list operations:

A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.

It seems to me that it's possible that this code could list only some of the partition files. If so, then the manifest would not include those missed non-empty partitions, and Redshift would not load them.

In other words, the scenario I'm describing is:

  1. Spark stages the data on S3.
  2. The above quoted block of code attempts to list the non-empty partitions by querying S3.
  3. Because of S3's consistency model, some of the files generated in step 1 may not show up in the list operation from step 2.
  4. unloadData() generates a manifest that misses some of these files, so they don't get loaded into Redshift.

Relevant background reading: https://github.com/databricks/spark-redshift/issues/136#issuecomment-165236191

cherrera20 commented 4 years ago

Good new: https://aws.amazon.com/es/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

nchammas commented 3 years ago

That is great news. This eliminates the need for an output manifest from Spark for any downstream processes, including a load into Redshift.