A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
It seems to me that it's possible that this code could list only some of the partition files. If so, then the manifest would not include those missed non-empty partitions, and Redshift would not load them.
I'm looking at this code: https://github.com/spark-redshift-community/spark-redshift/blob/84ebe9d5186370794c1c1dc82db9dba15679f9f9/src/main/scala/io/github/spark_redshift_community/spark/redshift/RedshiftWriter.scala#L317-L319
How do we know that
fs.listStatus()
is listing all the files written to S3? There are no read-after-write consistency guarantees for S3 list operations:It seems to me that it's possible that this code could list only some of the partition files. If so, then the manifest would not include those missed non-empty partitions, and Redshift would not load them.
In other words, the scenario I'm describing is:
unloadData()
generates a manifest that misses some of these files, so they don't get loaded into Redshift.Relevant background reading: https://github.com/databricks/spark-redshift/issues/136#issuecomment-165236191