Visualizing cloud-optimized geotiffs

rsignell-usgs commented 5 years ago

Is your feature request related to a problem? Please describe. Cloud-Optimized Geotiffs are quickly becoming a community standard for storing geospatial raster data on the Cloud. These COGs store the data in small tiles, and have built-in overviews, which makes extracting data fast (especially when using parallel processing). And the cogeo tools make it easy to create and validate COGs.

There is a AWS-lambda service that can create a WMTS tile service on-the-fly from any COG on object storage, and this works great in geoviews: https://nbviewer.jupyter.org/gist/jsignell/9980e640536180b44fb59e7812208e27/cog_tiling_small_panel.ipynb

But it would be way cooler for geoviews to be able to extract the chunks a render them in parallel using dask, so that on-the-fly service wouldn't even be needed.

Describe the solution you'd like I'd like some tools to extract and visualize the correct chunks from the most appropriate overviews in a COG, using Dask to parallelize the effort. This would allow us to use geoviews to effectively and interactively browse any COG to full resolution in the browser without additional services.

Describe alternatives you've considered We can use the "on-the-fly" WMTS service, but someone needs to run this and it also costs $$.

It would be way cooler to have this functionality native to geoviews!

rsignell-usgs commented 4 years ago

@jbednar, what would it take to get this on the docket?

jbednar commented 4 years ago

I'm not sure. From https://github.com/pangeo-data/pangeo-example-notebooks/issues/21 it sounds like there has already been some effort at improving access to COGs for xarray usage? Is any of that simply usable with geoviews already, i.e. to use rasterio/gdal/etc. to make a lazy xarray that GeoViews can use without further work? I haven't looked closely at what's been done and what remains...

scottyhq commented 4 years ago

@jbednar and @rsignell-usgs the linked issue above is one of many where we've speculated on the ability to make xarray+dask+holoviews more efficient working with COGs . there hasn't been a lot of concentrated effort. I think a on-the-fly WMTS layer is a special case, and there are some lower-hanging fruit improvements that could be made that i'll try to clarify below.

Consider a single COG (maybe we open a separate issue in holoviews/hvplot):

import os, rioxarray, hvplot.xarray
os.environ['GDAL_DISABLE_READDIR_ON_OPEN']='EMPTY_DIR'
os.environ['AWS_NO_SIGN_REQUEST']='YES'
url = 'https://sentinel-s1-rtc-indigo.s3.us-west-2.amazonaws.com/tiles/RTC/1/IW/10/T/ET/2020/S1B_20200106_10TET_ASC/Gamma0_VV.tif'
da = rioxarray.open_rasterio(url)
da.hvplot.image(rasterize=True, cmap='gray', clim=(0,0.4))

This is amazing functionality because as we zoom into the plot we get updating resolution on-the-fly. I could be wrong, but I believe the entire .tif is read into memory and then rasterized according to zoom level? So effectively running something like da.coarsen(x=16,y=16, boundary='pad').mean(). However, a COG has precomputed 'overviews' so you don't actually have to do this computation yourself. For the initial plot for example the code below is extremely fast because instead of pulling 5490x5490 pixels and resampling, just pull the lowest resolution overview (344x344 pixels):

da = rioxarray.open_rasterio(url, overview_level=3)
da.hvplot.image(rasterize=True, cmap='gray', clim=(0,0.4))

It's a complicated stack of pieces coming together here. Ultimately the TIF is read with rasterio.open() and GDAL/libtiff which is clever about only reading from overviews depending on the read request. This code block is equivalent to the above, returning a numpy array.

import rasterio
with rasterio.open(url) as src:
    numpy_array = src.read(out_shape=(src.height//16, src.width//16))

Information on this particular test file:

with rasterio.open(url) as src:
    print(src.profile)  
    overview_factors = [src.overviews(i) for i in src.indexes][0]
    overview_levels = list(range(len(overview_factors)))
    print('Overview levels: ', overview_levels)
    print('Overview factors: ',  overview_factors)

{'driver': 'GTiff', 'dtype': 'float32', 'nodata': 0.0, 'width': 5490, 'height': 5490, 'count': 1, 'crs': CRS.from_epsg(32610), 'transform': Affine(20.0, 0.0, 499980.0,
       0.0, -20.0, 5300040.0), 'blockxsize': 512, 'blockysize': 512, 'tiled': True, 'compress': 'deflate', 'interleave': 'band'}
Overview levels:  [0, 1, 2, 3]
Overview factors:  [2, 4, 8, 16]

In summary, it seems like there is room to improve the rasterize=True efficiency when overviews are present in the underlaying data. I can think of a few approaches:

implement hvplot.cog(rasterize=True) that reads from a given overview depending on zoom level. Or maybe have has_overviews attribute in the DataArray that holoviews can work with.
figure out how to pass out_shape= to the xarray rasterio backend to offload efficient reading of the COG overviews to the underlaying GDAL driver. https://github.com/pydata/xarray/issues/3269

holoviz / geoviews

Visualizing cloud-optimized geotiffs #370