Open TomAugspurger opened 3 years ago
You could probably parse both eo:bands
and raster:bands
if it exists. Seems you could optimize the Dask arrays in some ways if you already know e.g. what the data type of the file is.
@gjoseph92 implementation-wise, do you have a sense for which of these would be preferable?
image
and replace it with one asset per band. Each new asset would have the same href
, so we also need to include the band index number somewhere in the item. The Reader would read a single band.eo:bands
in each asset. Then (somehow) update the Dataset construction logic to know about these bands. The Readers would maybe need to be updated to know that they're supposed to read all the bands.If I had to guess, 1 sounds a bit easier to shoehorn into the current setup, but I'm not confident in that.
Thanks for opening this @TomAugspurger. It's true that 1 might be easier, but I think we should go with some form of 2 for a few reasons. I had always intended Readers to support multi-band; I just cut that corner to get the first release out.
Depending on the file format, it may not be possible to read the individual bands within an asset separately. For example by default, GeoTIFFs are pixel-interleaved. Therefore, we really don't want to have a separate "virtual" asset per band, because then fetching any of R, G, or B would require reading RGB—and trying to fetch R, G, and B would triple-read the data (GDAL block-caching aside; we don't want to rely on that too much) and open triple datasets, which will then be further copied per thread!
So basically, we need dask's chunking along the bands to match up with how the data is actually physically chunked. In the case of Sentinel-2, if we asked stackstac for assets=["B08", "visual-10m", "B11"]
, I'd want the chunks
along the bands axis to be (1, 3, 1)
. That way, whether you access stack[:, 1]
or stack[:, 1:3]
, dask would fetch just one asset, which would read all three RGB bands.
Of course non-GeoTIFF formats (or band-interleaved GeoTIFFs) might support efficient sub-selection by band, but since stackstac primarily cares about STAC best practices right now, and GeoTIFF is STAC best practice, I think it'd be reasonable to start by assuming assets don't support efficient band sub-selection.
So we'll say we always have one chunk per asset, and the length of that chunk equals the number of bands in that asset.
Then, I think we'd do something like:
In prepare_items
, figure out which bands are in the asset.
IMO this is actually the hardest part and the reason I skipped it, because I didn't want to deal with item_assets vs eo:bands
on the Item vs eo:bands
on the Asset, and item_assets
won't even work currently because prepare_items
is designed around taking a sequence of STAC items, not a Collection. Now that ItemCollection exists, I'm not sure if this is less of a pain—can an ItemCollection have an item_assets
field? In practice, will any STAC API actually fill that in? Curious if @matthewhanson has any recommendations here. Once you figure this hard part out, then please do #3 as well!
read()
vs read(1)
). I want to guard against incorrect STACs breaking the dask array downstream.adjust_chunks={"b": bands_chunks}
or something in the blockwise here).Are there any new developments concerning this? I wonder if there is a workaround that allows us to work with multiband data (apart from creating single band copies).
Or can we simply input single band VRTs pointing to the multiband data, without loosing to much performance?
Unless @TomAugspurger has been working on it, there are no updates. Updating the dask and rasterio side of things to handle multi-band assets is comparatively straightforward. The main thing I haven't wanted to deal with is parsing the various places in STAC where the number of bands per asset can be defined.
There isn't a workaround for this that I know of. If you want to take a stab at it yourself, contributions are always welcome!
@gjoseph92 thank you for your answer. I tried it out: Working with single-band VRTs pointing to multiband COGs worked well for me, allowing to use stackstac in the intended way, i.e. single-band raster inputs. Also it seems to have no impact on processing speed (concluding from a small, limited comparative experiement).
@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?
I haven’t done any work on this. I’d also be curious to see an example using VRTs.
On Mar 16, 2022 at 12:31:44 PM, Gabe Joseph @.***> wrote:
@scotjohn https://github.com/scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?
— Reply to this email directly, view it on GitHub https://github.com/gjoseph92/stackstac/issues/62#issuecomment-1069384513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIUSX36WZWHBTN2ROFDVAILIBANCNFSM47GBPUDQ . You are receiving this because you were mentioned.Message ID: @.***>
@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?
Yes, it is as you described.
I work with 4-band PlanetFusion data that are available as multiband COGs. For each COG I create four VRTs, i.e. one per band, which become the assets. I have written the STAC from scratch. So I am not sure how this fits into the problem of existing STACs and public data catalogs.
To generate VRTs from COGs:
def create_bandwise_vrt(in_file: Path, out_dir: Path, n_bands: int):
for i in range(1, n_bands + 1):
out_vrt = Path(out_dir, in_file.stem + f"_B{i:02}.vrt")
in_files = [str(in_file)]
vrt_options = gdal.BuildVRTOptions(
bandList=[i]
)
gdal.BuildVRT(str(out_vrt), in_files, options=vrt_options)
Hi @gjoseph92 , is this feature planned to be integrated some point soon or is there a blocker that needs support ?
@julianblue my current job doesn't allocate time for me to work on stackstac, so there's no timeline for this happening. I'm opening to reviewing PRs though.
This is mentioned in the README, but I thought I'd open an issue that I can link to.
Currently, multi-band cogs are not supported by
stackstac
.That COG has 4 bands, so the shape should be
(1, 4, 12240, 11040)
.I started a branch, but haven't had a chance to finish it off. I'll post here if / when I pick it up again.