Open keewis opened 1 year ago
FWIW, I'm having some reservations on embedding Kerchunk metadata in STAC items. I'll post over inhttps://github.com/stac-utils/xpystac/issues/32.
For the media types stuff, AFAIK it's very challenging to officially register one. We might be able to use asset roles to indicate that the asset at that link (which might be JSON, parquet, ...) is intended to be consumed by Kerchunk.
Yeah I saw that when I was working on #32! That's great to see!!
- define a kerchunk mime type (media type?) that would work regardless of whether the references are stored in a JSON file, a compressed JSON file, a parquet store or a (hypothetical) zarr store
So far xpystac has been using a combination of media_type
and role
to tell if something is a kerchunk file or not. Here is the code for that: https://github.com/stac-utils/xpystac/blob/342c1973157903b03eb8887954abf049a559b858/xpystac/core.py#L74-L76
Obviously the media_type
solution isn't great for potentially supporting kerchunk metadata stored in parquet or something, but I think role
is exactly right for this. So just a matter of agreeing on what the correct role is for kerchunk files. It looks like xpystac is expecting index
or references
, so maybe we just pick one of those?
- make
xpystac
detect the mime type and open the references using the pattern above
Yeah I think the work would be that when passed an item, we should look for an asset that has the specific kerchunk
role. I am wondering if it's necessary to actually depend on the kerchunk library to achieve this functionality or whether the existing implementation is good enough:
The main issue I have with relying on the kerchunk library directly is the dependencies which I don't think you need if you are just reading.
- decide whether or not it makes sense to have
xpystac
deal with collection level assets
I don't see why not. xpystac doesn't currently try to do anything with collections, so it should be fine to just look for a kerchunk asset in there.
oops Tom beat me to it :smile:
wow, it looks like I didn't read the readme or the code close enough! Looks like xpystac
already does what I was suggesting (although documentation on this feature is a bit sparse right now).
Obviously the
media_type
solution isn't great for potentially supporting kerchunk metadata stored in parquet or something, but I thinkrole
is exactly right for this.
That makes sense to me. The only question I'd have is: are those sufficient? In other words, do we think that someone might want to use "index"
or "references"
as role
in a different context? If yes, we might either have to use a even more specific role (kerchunk
?) or we'd have to maintain a set of accepted media types (or get kerchunk
to maintain it for us).
(maybe I'm overthinking this, and we can just fix if anything like this occurs)
The main issue I have with relying on the kerchunk library directly is the dependencies which I don't think you need if you are just reading.
For reference, which dependency of kerchunk
is it that you don't think is necessary for reading? Both the package metadata and the conda-forge
feedstock list fsspec
, numcodecs
, ujson
and zarr
(the package metadata also has numpy
) as dependencies. Of these, I believe only ujson
is not strictly necessary for reading (I might be wrong, though), but it usually is faster than the builtin json
so it doesn't hurt either.
That said, as it is the mapper creation is hidden which was the main point of this issue, so the only argument for switching to engine="kerchunk"
that I have is that it would make the implementation in xpystac
a bit more symmetric.
I don't see why not. xpystac doesn't currently try to do anything with collections, so it should be fine to just look for a kerchunk asset in there.
I added that because I've seen @TomAugspurger has some zarr
stores of model data as collection-level assets, which I think makes sense (and I'd treat reference files the same as zarr
stores).
documentation on this feature is a bit sparse right now
lol what an understatement :)
maybe I'm overthinking this, and we can just fix if anything like this occurs
Yeah I suspect that we can just try to treat the "index"
or "reference"
as kerchunk and do some nice error handling for the case where an asset with that role exists and is not kerchunk. On the other hand it does seem like it's encouraged to add new roles (https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#list-of-asset-roles) so if we feel that those are not explicit enough then we can create a "kerchunk"
role.
which dependency of kerchunk is it that you don't think is necessary for reading?
I was thinking numcodecs
but on second thought, maybe I was misremembering a different dependency issue that I was running into on #33. I do agree that even if it's just 6 lines it's a lot cleaner to no duplicate the read functionality.
I added that because I've seen @TomAugspurger has some zarr stores of model data as collection-level assets, which I think makes sense (and I'd treat reference files the same as zarr stores).
xpystac is always happy to operate on an asset (even a collection asset) but if you are suggesting that we could accept a collection and just figure out if there is a zarr asset I guess we'd need to know exactly which roles we are expecting the asset to have.
On the other hand it does seem like it's encouraged to add new roles (...) so if we feel that those are not explicit enough then we can create a
"kerchunk"
role.
Sounds good to me, but let's check. @m-mohr, what do you think? Does that sound like something we should be doing? Or do you know a better way (or someone who does)?
I guess we'd need to know exactly which roles we are expecting the asset to have.
I guess so. I had imagined we could just filter collection-level assets by media type / role as we do with items, but that was before realizing that xpystac
simply forwards to either stackstac
or odc.stac
so not sure
I had imagined we could just filter collection-level assets by media type / role as we do with items, but that was before realizing that xpystac simply forwards to either stackstac or odc.stac so not sure
For items yeah, but for collections there isn't really a behavior defined yet. So I think we can just look for the zarr media_type and read it if found.
Ideally, bring this up in the next community call :-) I don't have the time right now to read through the long issue to get the context, sorry. @keewis
of course! Where can I find more about that? Naively searching for "stac community call" does not yield any information on when and where to join.
@keewis Good question, I thought there would be information available, but seems like it's not yet. If you join https://groups.google.com/g/stac-community you should get an invite for the bi-weekly calls afaik.
I started looking into opening the zarr asset if the user passes a collection. I was using planetary computer as an example and I think the trouble is that on any given collection there might be several different collection-level zarr or kerchunk assets. For instance: https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6 has 17 kerchunk assets. So I am kind of thinking that the current implementation of requiring the user to supply an asset directly to xpystac might be the best we can do.
FWIW, don't place too much weight on how I did those. That was my first time using Kerchunk.
I was writing up a post explaining my reasoning behind the collection-level assets, but found I couldn't make sense of it :) I suspect that if I were redoing that dataset today, I would move, for example, the collection-level asset at CESM2.historical
to the corresponding item for that (modeling center, time range)
pair.
I think the best collection-level asset, if feasible, would be a set of Kerchunk references that concatenates all of these item-level references into a single Dataset (or DataTree? I'm not sure.)
Ah ok that is helpful context. I think I agree that it feels conceptually like an item should be readable without the user needing to pick and choose assets.
Hi everyone, I jump in the discussion trying to figure out how to create and read a meaningful STAC Collections based on Kerchunk'ed netCDFs.
Here is a sample code I'm using. The url refers to the sample Item I created. Unfortunately it doesn't seem to be able to handle it properly:
RasterioIOError: '/vsicurl/https://gist.githubusercontent.com/clausmichele/725f41a4058e3c5159876d5b8bc3ad42/raw/62af0f40eeaee26e10689bffd8f645d5f76105f8/t2m_2001_crs.nc.json' not recognized as a supported file format.
import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)
ds1 = xr.open_dataset(item_1, engine="stac")
ds1
Do you have suggestions on how to proceed?
This code would work with my current data structure:
import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)
stac_dict_2 = json.loads(stac_api.read_text(url_2))
item_2 = stac_api.stac_object_from_dict(stac_dict_2)
items = [item_1,item_2]
datasets_list = []
for item in items:
for asset in item.assets:
print(item.assets[asset])
if item.assets[asset].href.endswith(".json"):
data = xr.open_dataset(
"reference://",
engine="zarr",
decode_coords="all",
backend_kwargs={
"storage_options": {
"fo":item.assets[asset].href,
},
"consolidated": False
},chunks={}
).to_dataarray(dim="bands")
datasets_list.append(data)
# Need to create one Item per time/netCDF
data = xr.combine_by_coords(datasets_list,combine_attrs="drop_conflicts")
data
Thanks for taking the time to write that up @clausmichele that is an interesting example!
For kerchunk, xpystac expects one asset pointing to a kerchunk file. There is no logic for one data variable in each asset of an item. For this reason, you might be better off keeping the combination logic in your code and using the "kerchunk"
engine directly. Here is what that would look like:
import pystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]
datasets_list = []
for item in items:
for asset in item.assets.values():
print(asset)
if asset.href.endswith(".json"):
data = xr.open_dataset(asset.href, engine="kerchunk", chunks={})
datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data
Of course you could do the same thing with xpystac but you need to make sure that roles
and media_type
are set on the asset so that xpystac knows what it is looking at:
import pystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]
datasets_list = []
for item in items:
for asset in item.assets.values():
print(asset)
if asset.href.endswith(".json"):
asset.media_type = "application/json"
asset.roles = ["index"]
data = xr.open_dataset(asset, engine="stac", chunks={})
datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data
I can imagine a world in which xpystac builds up a little bit of its own stacking logic so that you could stack zarr or kerchunk assets -- which I don't think are supported by stackstac
or odc-stac
.
That would make it so (once you have fixed up your media_type
and roles
) you could just do:
data = xr.open_dataset(items, engine="stac", stacking_library="xpystac")
I wrote up a little proof of concept for this, but I'm not 100% sure it's a good idea to put very much logic into xpystac. Maybe the thing to do would be to get stacstack and odc-stac to support kerchunk themselves.
import pystac import xarray as xr url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json" url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json" item_1 = pystac.read_file(url_1) item_2 = pystac.read_file(url_2) items = [item_1, item_2] datasets_list = [] for item in items: for asset in item.assets.values(): print(asset) if asset.href.endswith(".json"): data = xr.open_dataset(asset.href, engine="kerchunk", chunks={}) datasets_list.append(data) data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts") data
This returned an error for me:
ValueError: unrecognized engine kerchunk must be one of: ['netcdf4', 'h5netcdf', 'scipy', 'rasterio', 'stac', 'store', 'zarr']
Solved with:
git clone https://github.com/fsspec/kerchunk
cd kerchunk
pip install .
probably it's an unreleased functionality?
Anyway, thanks for the suggestions!
hmm yeah might not be released, but glad it worked!
kerchunk
somewhat recently included aopen_dataset
engine:This allows getting rid of the somewhat confusing and certainly hard to remember "create a fsspec mapper and feed it to zarr" pattern that was necessary before.
I believe we can make use of this to get
xpystac
to transparently open reference files from STAC items / collections. Note that this would be different from the proposal in #32, which aims to store the references directly in STAC (as far as I can tell, #32 would avoid the additional download step, but wouldn't be able to support enormous reference data or file formats other than JSON).To make this happen, I think we would need to:
xpystac
detect the mime type and open the references using the pattern abovexpystac
deal with collection level assetsI can help with 2, but I don't really have an idea of how to construct (and register?) a proper mime type. 3 would need some discussion, but should be easy to implement.
cc @dwest77a, @TomAugspurger