Open scottyhq opened 3 years ago
also interesting discussion about how S3 throttling works for single files versus separate files.
minor clarification. S3 throttles per key prefix which means its less about single/separate files and more the structure of those keys within the bucket. Consider landsat-pds
, where each Landsat scene is under a unique key prefix using the grid row/col and scene ID.
aws s3 ls s3://landsat-pds/L8/220/244/LC82202442014222LGN00/
2016-11-22 23:28:19 61369561 LC82202442014222LGN00_B1.TIF
2016-11-22 23:28:27 8306483 LC82202442014222LGN00_B1.TIF.ovr
2016-11-22 23:28:20 47711056 LC82202442014222LGN00_B10.TIF
2016-11-22 23:28:33 7513726 LC82202442014222LGN00_B10.TIF.ovr
...
In this example, the key prefix is L8/220/244/LC82202442014222LGN00
. I'm a big fan of organizing S3 datasets in this way because you can optimize the S3 rate limit by distributing it geographically across a grid (in this case WRS). As long as clients distribute their requests across this grid they can achieve very high throughputs.
Another interesting property of S3 is it performs much better on fewer larger files than many smaller ones. There is lots of overhead in accessing a new file for the first time. You have to establish a connection, navigate TLS, and there are a handful of additional operations that happen within the data center. All of this adds up to time-to-first-byte (TTFB), which is the most expensive part of performing remote reads against S3.
Of course all of this changes with different filesystems, so when thinking about lots of little COGs vs fewer larger COGs, it is critical to consider the properties of the filesystem being used and optimize around that.
most COGs in the wild tend to be ~100km footprint (corresponding to a single satellite acquisition). So total file size is something less than 1GB. But you can also generate huge cogs that cover a large spatial area or have very fine resolution. Here is a discussion about considerations for that https://twitter.com/howardbutler/status/1379053172497375232
a relevant point:
also interesting discussion about how S3 throttling works for single files versus separate files.