njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
50 stars 4 forks source link

Explore timing/speed issues when using tiles that match block size vs our own spec #82

Open njtierney opened 1 month ago

njtierney commented 1 month ago

I'm curious to see what the difference in speed is for breaking a raster into different tile size.

Like if we have a big raster and break it into 10 tiles, but the blocksize is more like 4 tiles, is there any appreciable difference using 10 tiles vs 4?

Aariq commented 1 month ago

I'm unfamiliar with the term "blocksize"—can you elaborate? I assume there is a "sweet spot" because there is going to be overhead for making the tiles and for doing computations on them (marshaling/unmarshaling, etc.). I also assume it will depend on number of workers and RAM. Providing some kind of rough recommendation to users would be great though. Doing some benchmarking in a vignette or pkgdown article might be a good way to do this.

njtierney commented 1 month ago

My understanding of blocksize is kind of like the default tile size. When you start to read a raster from file into memory, if your query is only within a single block, only a single block is read into memory. But for example if you read a really wide raster and wanted to read just the first 2 rows but all columns, then (depending on how blocksize is specified), you could end up reading in all blocks just to get a few rows out.

Like you said, it's a bit of a tricky thing to find the sweet spot, since it depends on RAM/workers/read-write speed, and probably other things. I believe this is what the zarr project https://zarr.dev/ is about - handling that kind of meta data properly.

I like your idea for providing some recommendations to users - perhaps one thing that could be handy would be producing a schema/plot from https://github.com/hypertidy/grout that shows your raster and the blocksize that is currently specified. This could give you a general sense of how many blocks might be needed, or maybe you might want to use some integer multiple of the blocksize. Or maybe there's just two - it's not always easy to know!

Totally agree that some benchmarking in a vignette or article would be a nice way to demonstrate this to users. My inkling is that it might only really matter for large rasters, but I think we'll only really know once we write it down.

It's also slightly complicated by the fact that GDAL will cache your read when you initially read in a set of blocks. So if it takes like 20 seconds to do a first long wide pass reading in just a few rows from all blocks, subsequent reads in the same session will be 10x (or more) faster as those blocks are now in memory.