Experiment with stacking for kerchunk

jsignell commented 6 months ago

This idea came out of a comment here: https://github.com/stac-utils/xpystac/issues/34#issuecomment-1988612112

Conceptually it seems like it should be possible to read and stack kerchunk and zarr data contained in an item's assets or a list of item's assets. Not sure if this is the most elegant way :shrug:


import pystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"

item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]

# these items don't specify the media_type and role that xpystac uses to assert that
# an asset refers to a kerchunk reference file. So first tidy that up.
for item in items:
    for asset in item.assets.values():
        if asset.href.endswith(".json"):
            asset.media_type = "application/json"
            asset.roles = ["index"]

data = xr.open_dataset(items, engine="stac", stacking_library="xpystac", chunks={})

clausmichele commented 6 months ago

@jsignell you can use these new version of the Items, with the correct media type and roles set to index:


url_1 = "https://gist.githubusercontent.com/clausmichele/b101fcf12f17c746b2c5db57ef43a650/raw/bd7c2c2d25a328d01b316ec9bbab2c7503c0e343/CERRA-20010101000000_20011231000000_2.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/b101fcf12f17c746b2c5db57ef43a650/raw/bd7c2c2d25a328d01b316ec9bbab2c7503c0e343/CERRA-20020101000000_20021231000000_2.json"

jsignell commented 6 months ago

Nice! Yeah it works well with those versions:

import pystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/b101fcf12f17c746b2c5db57ef43a650/raw/bd7c2c2d25a328d01b316ec9bbab2c7503c0e343/CERRA-20010101000000_20011231000000_2.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/b101fcf12f17c746b2c5db57ef43a650/raw/bd7c2c2d25a328d01b316ec9bbab2c7503c0e343/CERRA-20020101000000_20021231000000_2.json"

item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]

data = xr.open_dataset(items, engine="stac", stacking_library="xpystac", chunks={})
data

Since it's purely additive I don't see the harm in merging this once I write up some tests.

stac-utils / xpystac

Experiment with stacking for kerchunk #38