microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.4k stars 308 forks source link

Benchmarking of GeoDataset for a paper result #81

Closed calebrob6 closed 2 years ago

calebrob6 commented 3 years ago

Datasets

We want to test several popular image sources, as well as both raster and vector labels.

There is also a question of which file formats to test. For example, sampling from GeoJSON can take 3 min per getitem, whereas ESRI Shapefile only takes 1 sec per getitem (https://github.com/microsoft/torchgeo/pull/69#issuecomment-892816306).

Experiments

For the warping strategy, we should test the following possibilities:

What is the upfront cost of these pre-processing steps?

Example notebook: https://gist.github.com/calebrob6/d9bc5609ff638d601e2c35a1ab0a2dec

adamjstewart commented 3 years ago

I think this will require a significant rework of our __getitem__ implementation. Right now, we warp and then merge/sample from a tile at the same time. If we want to benefit from the 2-step random tile/chip sampling strategy, we'll have to use an LRU cache on the entire tile after warping.

adamjstewart commented 3 years ago

I think we can also consider the following I/O strategies:

Merging should happen after the fact so that (tile 1, tile 2, tile 1 + 2) don't end up being 3 different entries in the cache.

I don't think we need to consider situations in which we:

These strategies make sense for tile-based raster images, but are slightly more complicated for vector geometries or static regional maps. We may need to change the default behavior based on the dataset.

adamjstewart commented 3 years ago

For timing, we should choose some arbitrary epoch size, then experiment with various batch sizes and see how long it takes to load an entire epoch.

adamjstewart commented 3 years ago

Here's where I'm currently stuck to remind myself when I next pick this up:

Our process right now is:

  1. Open filehandles for raw data (rasterio.open)
  2. Open filehandles for warped VRTs (rasterio.vrt.WarpedVRT)
  3. Merge VRTs to get an array (rasterio.merge.merge)
  4. Return array as tensor

Steps 1 and 2 don't actually do anything and are almost instantaneous. It isn't until you actually try to read() the data that warping occurs, and read() is called in rasterio.merge.merge. If we want to cache this reading of warped data, we'll have to call vrt.read() ourselves. Since rasterio.merge.merge only accepts filenames or filehandles as input, we'll basically need to implement our own merge algorithm that takes 1+ cached numpy arrays, creates a new array with the correct dimensions, and indexes the old arrays to copy the data. The hard part here will be keeping track of coordinates, nodata values, and merging correctly. See https://github.com/mapbox/rasterio/blob/master/rasterio/merge.py for the source code, most of which we'll need to do as well.

adamjstewart commented 3 years ago

Another hurdle: the size of each array depends greatly on the dataset, but most are around 0.5 GB per file. We can't really assume users have >8 GB of RAM, which greatly limits our LRU cache size. We could use something like psutil to query the system memory, and hard-code the avg file size for each dataset if we want to make things more flexible.

adamjstewart commented 3 years ago

For now, I think we can rely on GDAL's internal caching behavior. When I read a VRT the second time around, it seems to be significantly faster. Still not as fast as reading the raw data or as indexing from a loaded array, but good enough for a first round of benchmarking. GDAL also lets you configure the cache size.

adamjstewart commented 3 years ago

Preliminary results look very promising! benchmark

calebrob6 commented 2 years ago

@adamjstewart, sketch of the full experiment:

adamjstewart commented 2 years ago

@calebrob6 the above proposal covers the matrix of:

There are a lot of additional constraints that we're currently skipping:

Do you think it's fine to skip these for the sake of time? I doubt reviewers would straight up reject us for not including one of these permutations, and can always ask us to perform additional experiments if they want.

Also, we should definitely benchmark not only RasterDataset but also VectorDataset (maybe Sentinel + Canadian Building Footprints?). Should I purposefully change the resolution of one of these datasets? Should I purposefully switch to a CRS different than all files or keep the CRS of one of the files?

adamjstewart commented 2 years ago

Also, do we want to compare with different batch_sizes or different num_workers?

calebrob6 commented 2 years ago

I'd do the first matrix as quickly as possible because the results of that are going to be very informative. If that all works out then you can repeat the same with a vectordataset.

File format: GeoTIFF vs. HDF5, Shapefile vs. GeoJSON

I don't think this is important right now. I.e. we can just assume the data is in a good format (COG and shapefile/geopackage)

Warping strategy

In the above sketch you can repeat the experiments with the manually aligned versions of the dataset to test the "already in correct CRS/res" case. The first set of experiments is with "change CRS and res". It might be interesting to see if warping or resampling is more expensive, but not interesting for the paper I think.

Also, do we want to compare with different batch_sizes or different num_workers?

Sure! These experiments should be very quick to run once you have a script for them.

calebrob6 commented 2 years ago

Some things to discuss soon:

adamjstewart commented 5 months ago

We're following up on this discussion in https://github.com/microsoft/torchgeo/issues/1330#issuecomment-1962896565