pangeo-data / cog-best-practices

Best practices with cloud-optimized-geotiffs (COGs)
BSD 3-Clause "New" or "Revised" License
77 stars 9 forks source link

how large should cogs be? #9

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

most COGs in the wild tend to be ~100km footprint (corresponding to a single satellite acquisition). So total file size is something less than 1GB. But you can also generate huge cogs that cover a large spatial area or have very fine resolution. Here is a discussion about considerations for that https://twitter.com/howardbutler/status/1379053172497375232

a relevant point:

Depends on how smart the client is. A 1 megapixel x 1 megapixel raster tiled as 512x512 has a 58 MB tile index ((1e6/512)^228) GDAL with recent libtiff will only read a few KB in it when extracting a given tile. Less smart clients will ingest those whole 58 MB at file opening

also interesting discussion about how S3 throttling works for single files versus separate files.

geospatial-jeff commented 3 years ago

also interesting discussion about how S3 throttling works for single files versus separate files.

minor clarification. S3 throttles per key prefix which means its less about single/separate files and more the structure of those keys within the bucket. Consider landsat-pds, where each Landsat scene is under a unique key prefix using the grid row/col and scene ID.

aws s3 ls s3://landsat-pds/L8/220/244/LC82202442014222LGN00/

2016-11-22 23:28:19   61369561 LC82202442014222LGN00_B1.TIF
2016-11-22 23:28:27    8306483 LC82202442014222LGN00_B1.TIF.ovr
2016-11-22 23:28:20   47711056 LC82202442014222LGN00_B10.TIF
2016-11-22 23:28:33    7513726 LC82202442014222LGN00_B10.TIF.ovr
...

In this example, the key prefix is L8/220/244/LC82202442014222LGN00. I'm a big fan of organizing S3 datasets in this way because you can optimize the S3 rate limit by distributing it geographically across a grid (in this case WRS). As long as clients distribute their requests across this grid they can achieve very high throughputs.

Another interesting property of S3 is it performs much better on fewer larger files than many smaller ones. There is lots of overhead in accessing a new file for the first time. You have to establish a connection, navigate TLS, and there are a handful of additional operations that happen within the data center. All of this adds up to time-to-first-byte (TTFB), which is the most expensive part of performing remote reads against S3.

Of course all of this changes with different filesystems, so when thinking about lots of little COGs vs fewer larger COGs, it is critical to consider the properties of the filesystem being used and optimize around that.