gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
250 stars 49 forks source link

Support multi-band COGs #62

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

This is mentioned in the README, but I thought I'd open an issue that I can link to.

Currently, multi-band cogs are not supported by stackstac.

>>> import pystac
>>> import stackstac

>>> item = pystac.read_file(
...     "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip/items/fl_m_2608004_nw_17_060_20191215_20200113"
... )
>>> ds = stackstac.stack([item.to_dict()])
>>> ds.shape
(1, 1, 12240, 11040)

That COG has 4 bands, so the shape should be (1, 4, 12240, 11040).

I started a branch, but haven't had a chance to finish it off. I'll post here if / when I pick it up again.

kylebarron commented 3 years ago

You could probably parse both eo:bands and raster:bands if it exists. Seems you could optimize the Dask arrays in some ways if you already know e.g. what the data type of the file is.

TomAugspurger commented 3 years ago

@gjoseph92 implementation-wise, do you have a sense for which of these would be preferable?

  1. Rewrite the incoming items to take a multiband asset like image and replace it with one asset per band. Each new asset would have the same href, so we also need to include the band index number somewhere in the item. The Reader would read a single band.
  2. Rewrite the asset detection logic to look for eo:bands in each asset. Then (somehow) update the Dataset construction logic to know about these bands. The Readers would maybe need to be updated to know that they're supposed to read all the bands.

If I had to guess, 1 sounds a bit easier to shoehorn into the current setup, but I'm not confident in that.

gjoseph92 commented 3 years ago

Thanks for opening this @TomAugspurger. It's true that 1 might be easier, but I think we should go with some form of 2 for a few reasons. I had always intended Readers to support multi-band; I just cut that corner to get the first release out.

Depending on the file format, it may not be possible to read the individual bands within an asset separately. For example by default, GeoTIFFs are pixel-interleaved. Therefore, we really don't want to have a separate "virtual" asset per band, because then fetching any of R, G, or B would require reading RGB—and trying to fetch R, G, and B would triple-read the data (GDAL block-caching aside; we don't want to rely on that too much) and open triple datasets, which will then be further copied per thread!

So basically, we need dask's chunking along the bands to match up with how the data is actually physically chunked. In the case of Sentinel-2, if we asked stackstac for assets=["B08", "visual-10m", "B11"], I'd want the chunks along the bands axis to be (1, 3, 1). That way, whether you access stack[:, 1] or stack[:, 1:3], dask would fetch just one asset, which would read all three RGB bands.

Of course non-GeoTIFF formats (or band-interleaved GeoTIFFs) might support efficient sub-selection by band, but since stackstac primarily cares about STAC best practices right now, and GeoTIFF is STAC best practice, I think it'd be reasonable to start by assuming assets don't support efficient band sub-selection.

So we'll say we always have one chunk per asset, and the length of that chunk equals the number of bands in that asset.

Then, I think we'd do something like:

  1. In prepare_items, figure out which bands are in the asset.

    IMO this is actually the hardest part and the reason I skipped it, because I didn't want to deal with item_assets vs eo:bands on the Item vs eo:bands on the Asset, and item_assets won't even work currently because prepare_items is designed around taking a sequence of STAC items, not a Collection. Now that ItemCollection exists, I'm not sure if this is less of a pain—can an ItemCollection have an item_assets field? In practice, will any STAC API actually fill that in? Curious if @matthewhanson has any recommendations here. Once you figure this hard part out, then please do #3 as well!

  2. Add the expected number of bands to the asset table.
  3. Make Readers check that the asset, when opened, actually has that number of bands, then read all of them (read() vs read(1)). I want to guard against incorrect STACs breaking the dask array downstream.
  4. Set the chunks of the output array along the bands dimension as described above (probably with an adjust_chunks={"b": bands_chunks} or something in the blockwise here).
scotjohn commented 2 years ago

Are there any new developments concerning this? I wonder if there is a workaround that allows us to work with multiband data (apart from creating single band copies).

Or can we simply input single band VRTs pointing to the multiband data, without loosing to much performance?

gjoseph92 commented 2 years ago

Unless @TomAugspurger has been working on it, there are no updates. Updating the dask and rasterio side of things to handle multi-band assets is comparatively straightforward. The main thing I haven't wanted to deal with is parsing the various places in STAC where the number of bands per asset can be defined.

There isn't a workaround for this that I know of. If you want to take a stab at it yourself, contributions are always welcome!

scotjohn commented 2 years ago

@gjoseph92 thank you for your answer. I tried it out: Working with single-band VRTs pointing to multiband COGs worked well for me, allowing to use stackstac in the intended way, i.e. single-band raster inputs. Also it seems to have no impact on processing speed (concluding from a small, limited comparative experiement).

gjoseph92 commented 2 years ago

@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?

TomAugspurger commented 2 years ago

I haven’t done any work on this. I’d also be curious to see an example using VRTs.

On Mar 16, 2022 at 12:31:44 PM, Gabe Joseph @.***> wrote:

@scotjohn https://github.com/scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?

— Reply to this email directly, view it on GitHub https://github.com/gjoseph92/stackstac/issues/62#issuecomment-1069384513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUSX36WZWHBTN2ROFDVAILIBANCNFSM47GBPUDQ . You are receiving this because you were mentioned.Message ID: @.***>

scotjohn commented 2 years ago

@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?

Yes, it is as you described.

I work with 4-band PlanetFusion data that are available as multiband COGs. For each COG I create four VRTs, i.e. one per band, which become the assets. I have written the STAC from scratch. So I am not sure how this fits into the problem of existing STACs and public data catalogs.

scotjohn commented 2 years ago

To generate VRTs from COGs:

def create_bandwise_vrt(in_file: Path, out_dir: Path, n_bands: int):
    for i in range(1, n_bands + 1):
        out_vrt = Path(out_dir, in_file.stem + f"_B{i:02}.vrt")
        in_files = [str(in_file)]
        vrt_options = gdal.BuildVRTOptions(
            bandList=[i]
        )
        gdal.BuildVRT(str(out_vrt), in_files, options=vrt_options)
julianblue commented 1 year ago

Hi @gjoseph92 , is this feature planned to be integrated some point soon or is there a blocker that needs support ?

gjoseph92 commented 1 year ago

@julianblue my current job doesn't allocate time for me to work on stackstac, so there's no timeline for this happening. I'm opening to reviewing PRs though.