opening STAC elements with assets pointing to `kerchunk` files

keewis commented 1 year ago

kerchunk somewhat recently included a open_dataset engine:

xr.open_dataset(url, engine="kerchunk", storage_options={...})

This allows getting rid of the somewhat confusing and certainly hard to remember "create a fsspec mapper and feed it to zarr" pattern that was necessary before.

I believe we can make use of this to get xpystac to transparently open reference files from STAC items / collections. Note that this would be different from the proposal in #32, which aims to store the references directly in STAC (as far as I can tell, #32 would avoid the additional download step, but wouldn't be able to support enormous reference data or file formats other than JSON).

To make this happen, I think we would need to:

define a kerchunk mime type (media type?) that would work regardless of whether the references are stored in a JSON file, a compressed JSON file, a parquet store or a (hypothetical) zarr store
make xpystac detect the mime type and open the references using the pattern above
decide whether or not it makes sense to have xpystac deal with collection level assets

I can help with 2, but I don't really have an idea of how to construct (and register?) a proper mime type. 3 would need some discussion, but should be easy to implement.

cc @dwest77a, @TomAugspurger

TomAugspurger commented 1 year ago

FWIW, I'm having some reservations on embedding Kerchunk metadata in STAC items. I'll post over inhttps://github.com/stac-utils/xpystac/issues/32.

For the media types stuff, AFAIK it's very challenging to officially register one. We might be able to use asset roles to indicate that the asset at that link (which might be JSON, parquet, ...) is intended to be consumed by Kerchunk.

jsignell commented 1 year ago

Yeah I saw that when I was working on #32! That's great to see!!

define a kerchunk mime type (media type?) that would work regardless of whether the references are stored in a JSON file, a compressed JSON file, a parquet store or a (hypothetical) zarr store

So far xpystac has been using a combination of media_type and role to tell if something is a kerchunk file or not. Here is the code for that: https://github.com/stac-utils/xpystac/blob/342c1973157903b03eb8887954abf049a559b858/xpystac/core.py#L74-L76

Obviously the media_type solution isn't great for potentially supporting kerchunk metadata stored in parquet or something, but I think role is exactly right for this. So just a matter of agreeing on what the correct role is for kerchunk files. It looks like xpystac is expecting index or references, so maybe we just pick one of those?

make xpystac detect the mime type and open the references using the pattern above

Yeah I think the work would be that when passed an item, we should look for an asset that has the specific kerchunk role. I am wondering if it's necessary to actually depend on the kerchunk library to achieve this functionality or whether the existing implementation is good enough:

https://github.com/stac-utils/xpystac/blob/342c1973157903b03eb8887954abf049a559b858/xpystac/core.py#L88-L96

The main issue I have with relying on the kerchunk library directly is the dependencies which I don't think you need if you are just reading.

decide whether or not it makes sense to have xpystac deal with collection level assets

I don't see why not. xpystac doesn't currently try to do anything with collections, so it should be fine to just look for a kerchunk asset in there.

jsignell commented 1 year ago

oops Tom beat me to it :smile:

keewis commented 1 year ago

wow, it looks like I didn't read the readme or the code close enough! Looks like xpystac already does what I was suggesting (although documentation on this feature is a bit sparse right now).

Obviously the media_type solution isn't great for potentially supporting kerchunk metadata stored in parquet or something, but I think role is exactly right for this.

That makes sense to me. The only question I'd have is: are those sufficient? In other words, do we think that someone might want to use "index" or "references" as role in a different context? If yes, we might either have to use a even more specific role (kerchunk?) or we'd have to maintain a set of accepted media types (or get kerchunk to maintain it for us).

(maybe I'm overthinking this, and we can just fix if anything like this occurs)

The main issue I have with relying on the kerchunk library directly is the dependencies which I don't think you need if you are just reading.

For reference, which dependency of kerchunk is it that you don't think is necessary for reading? Both the package metadata and the conda-forge feedstock list fsspec, numcodecs, ujson and zarr (the package metadata also has numpy) as dependencies. Of these, I believe only ujson is not strictly necessary for reading (I might be wrong, though), but it usually is faster than the builtin json so it doesn't hurt either.

That said, as it is the mapper creation is hidden which was the main point of this issue, so the only argument for switching to engine="kerchunk" that I have is that it would make the implementation in xpystac a bit more symmetric.

I don't see why not. xpystac doesn't currently try to do anything with collections, so it should be fine to just look for a kerchunk asset in there.

I added that because I've seen @TomAugspurger has some zarr stores of model data as collection-level assets, which I think makes sense (and I'd treat reference files the same as zarr stores).

jsignell commented 1 year ago

documentation on this feature is a bit sparse right now

lol what an understatement :)

maybe I'm overthinking this, and we can just fix if anything like this occurs

Yeah I suspect that we can just try to treat the "index" or "reference" as kerchunk and do some nice error handling for the case where an asset with that role exists and is not kerchunk. On the other hand it does seem like it's encouraged to add new roles (https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#list-of-asset-roles) so if we feel that those are not explicit enough then we can create a "kerchunk" role.

which dependency of kerchunk is it that you don't think is necessary for reading?

I was thinking numcodecs but on second thought, maybe I was misremembering a different dependency issue that I was running into on #33. I do agree that even if it's just 6 lines it's a lot cleaner to no duplicate the read functionality.

I added that because I've seen @TomAugspurger has some zarr stores of model data as collection-level assets, which I think makes sense (and I'd treat reference files the same as zarr stores).

xpystac is always happy to operate on an asset (even a collection asset) but if you are suggesting that we could accept a collection and just figure out if there is a zarr asset I guess we'd need to know exactly which roles we are expecting the asset to have.

keewis commented 1 year ago

On the other hand it does seem like it's encouraged to add new roles (...) so if we feel that those are not explicit enough then we can create a "kerchunk" role.

Sounds good to me, but let's check. @m-mohr, what do you think? Does that sound like something we should be doing? Or do you know a better way (or someone who does)?

I guess we'd need to know exactly which roles we are expecting the asset to have.

I guess so. I had imagined we could just filter collection-level assets by media type / role as we do with items, but that was before realizing that xpystac simply forwards to either stackstac or odc.stac so not sure

jsignell commented 1 year ago

I had imagined we could just filter collection-level assets by media type / role as we do with items, but that was before realizing that xpystac simply forwards to either stackstac or odc.stac so not sure

For items yeah, but for collections there isn't really a behavior defined yet. So I think we can just look for the zarr media_type and read it if found.

m-mohr commented 1 year ago

Ideally, bring this up in the next community call :-) I don't have the time right now to read through the long issue to get the context, sorry. @keewis

keewis commented 1 year ago

of course! Where can I find more about that? Naively searching for "stac community call" does not yield any information on when and where to join.

m-mohr commented 1 year ago

@keewis Good question, I thought there would be information available, but seems like it's not yet. If you join https://groups.google.com/g/stac-community you should get an invite for the bi-weekly calls afaik.

jsignell commented 1 year ago

I started looking into opening the zarr asset if the user passes a collection. I was using planetary computer as an example and I think the trouble is that on any given collection there might be several different collection-level zarr or kerchunk assets. For instance: https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6 has 17 kerchunk assets. So I am kind of thinking that the current implementation of requiring the user to supply an asset directly to xpystac might be the best we can do.

TomAugspurger commented 1 year ago

FWIW, don't place too much weight on how I did those. That was my first time using Kerchunk.

I was writing up a post explaining my reasoning behind the collection-level assets, but found I couldn't make sense of it :) I suspect that if I were redoing that dataset today, I would move, for example, the collection-level asset at CESM2.historical to the corresponding item for that (modeling center, time range) pair.

I think the best collection-level asset, if feasible, would be a set of Kerchunk references that concatenates all of these item-level references into a single Dataset (or DataTree? I'm not sure.)

jsignell commented 1 year ago

Ah ok that is helpful context. I think I agree that it feels conceptually like an item should be readable without the user needing to pick and choose assets.

clausmichele commented 8 months ago

Hi everyone, I jump in the discussion trying to figure out how to create and read a meaningful STAC Collections based on Kerchunk'ed netCDFs.

Here is a sample code I'm using. The url refers to the sample Item I created. Unfortunately it doesn't seem to be able to handle it properly:

The xarray object built from STAC contains only one date instead of 365 (I use start_date and end_date properties)
I get this error from odc-stac: RasterioIOError: '/vsicurl/https://gist.githubusercontent.com/clausmichele/725f41a4058e3c5159876d5b8bc3ad42/raw/62af0f40eeaee26e10689bffd8f645d5f76105f8/t2m_2001_crs.nc.json' not recognized as a supported file format.

import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"

stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)

ds1 = xr.open_dataset(item_1, engine="stac")
ds1

Do you have suggestions on how to proceed?

This code would work with my current data structure:

import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"

stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)
stac_dict_2 = json.loads(stac_api.read_text(url_2))
item_2 = stac_api.stac_object_from_dict(stac_dict_2)
items = [item_1,item_2]

datasets_list = []
for item in items:
    for asset in item.assets:
        print(item.assets[asset])
        if item.assets[asset].href.endswith(".json"):
            data = xr.open_dataset(
                "reference://",
                engine="zarr",
                decode_coords="all",
                backend_kwargs={
                    "storage_options": {
                        "fo":item.assets[asset].href,
                    },
                    "consolidated": False
                },chunks={}
            ).to_dataarray(dim="bands")
            datasets_list.append(data)
            # Need to create one Item per time/netCDF
data = xr.combine_by_coords(datasets_list,combine_attrs="drop_conflicts")
data

jsignell commented 8 months ago

Thanks for taking the time to write that up @clausmichele that is an interesting example!

For kerchunk, xpystac expects one asset pointing to a kerchunk file. There is no logic for one data variable in each asset of an item. For this reason, you might be better off keeping the combination logic in your code and using the "kerchunk" engine directly. Here is what that would look like:

import pystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"

item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]

datasets_list = []
for item in items:
    for asset in item.assets.values():
        print(asset)
        if asset.href.endswith(".json"):
            data = xr.open_dataset(asset.href, engine="kerchunk", chunks={})
            datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data

Of course you could do the same thing with xpystac but you need to make sure that roles and media_type are set on the asset so that xpystac knows what it is looking at:

import pystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"

item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]

datasets_list = []
for item in items:
    for asset in item.assets.values():
        print(asset)
        if asset.href.endswith(".json"):
            asset.media_type = "application/json"
            asset.roles = ["index"]
            data = xr.open_dataset(asset, engine="stac", chunks={})
            datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data

Possible future work

I can imagine a world in which xpystac builds up a little bit of its own stacking logic so that you could stack zarr or kerchunk assets -- which I don't think are supported by stackstac or odc-stac.

That would make it so (once you have fixed up your media_type and roles) you could just do:

data = xr.open_dataset(items, engine="stac", stacking_library="xpystac")

I wrote up a little proof of concept for this, but I'm not 100% sure it's a good idea to put very much logic into xpystac. Maybe the thing to do would be to get stacstack and odc-stac to support kerchunk themselves.

clausmichele commented 8 months ago

import pystac
import xarray as xr

url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"

item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]

datasets_list = []
for item in items:
    for asset in item.assets.values():
        print(asset)
        if asset.href.endswith(".json"):
            data = xr.open_dataset(asset.href, engine="kerchunk", chunks={})
            datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data

This returned an error for me: ValueError: unrecognized engine kerchunk must be one of: ['netcdf4', 'h5netcdf', 'scipy', 'rasterio', 'stac', 'store', 'zarr'] Solved with:

git clone https://github.com/fsspec/kerchunk
cd kerchunk
pip install .

probably it's an unreleased functionality?

Anyway, thanks for the suggestions!

jsignell commented 8 months ago

hmm yeah might not be released, but glad it worked!

stac-utils / xpystac

opening STAC elements with assets pointing to `kerchunk` files #34

Possible future work