intake / intake-stac

Intake interface to STAC data catalogs
https://intake-stac.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
110 stars 25 forks source link

How to pass `xarray_kwargs` from STAC catalog through to `intake` #110

Closed kthyng closed 3 years ago

kthyng commented 3 years ago

Hi! I am setting up a STAC catalog which is to be searched using pystac-client and the result then read into Python with intake, intake-stac, and intake-xarray. I would like to be able to pass through xarray_kwargs from my STAC catalog all the way through to the resulting intake catalog so that I can read in the datasets to xarray directly using information stored in the STAC catalog entries. I can pass xarray_kwargs into the "metadata" section of an intake entry, but not into a known xarray_kwargs attribute that is actually used when the dataset is read in. Is there a way to encode this properly in the STAC catalog so it passes all the way through? Thank you for any help.

TomAugspurger commented 3 years ago

Just to clarify, in your STAC catalog are you using the xarray-assets extension to store those arguments? If so, then I don't believe intake-stac currently uses xarray-assets.

There's a bit of discussion of this in https://github.com/intake/intake-stac/pull/90, but I haven't finished that up to actually use xarray-assets.

kthyng commented 3 years ago

Thanks for your response @TomAugspurger! No I am not using xarray-assets. Should I be? I have seen lots of useful-sounding packages to support this effort but it's been hard to tell what fits my needs and what doesn't. I am just storing information in the "extra_fields" section of the asset when I add an asset to an item that will go into the STAC catalog. With this approach, I am able to access things like pre-configured plots which I can view using the intake GUI — they go by default under the "metadata" heading in the intake catalog entry. But it looks like the xarray_kwargs need to be their own heading, not under metadata. Does this make sense? Or if not what would be a useful example to share?

TomAugspurger commented 3 years ago

I think we're still figuring out best practices :)

But yes, I think xarray-assets is appropriate to use in this case. As you say, you can store whatever you want on the object and it'll be available in extra_fields. But I think that a bit of structure is helpful here, so that tools like intake-stac have a standardized set of keys to check. I'm happy to hear what you / others think about this though.

I think examples would be helpful here. https://github.com/TomAugspurger/xstac/blob/main/examples/daymet/daily/hi/collection.json has an example collection, and https://planetarycomputer.microsoft.com/docs/quickstarts/reading-zarr-data/ has an example using using it to load the data. IMO, intake-stac should be updated to do something similar to

import fsspec
import xarray as xr

store = fsspec.get_mapper(asset.href, **asset.extra_fields["xarray:storage_options"])
ds = xr.open_zarr(store, **asset.extra_fields["xarray:open_kwargs"])
kthyng commented 3 years ago

Ok I think that makes sense. I can incorporate xarray-assets, understanding that what I need will eventually be possible.

In case it matters, I am currently just needing to access netcdf files but I assume that the same workflow will also work for not-zarr files. I'm more likely at this point to be able to use dask than zarr, but maybe also zarr in the future.

I should have said before when I mentioned seeing tools available but being unsure which to use, I've primarily gotten information from following your Pangeo Discourse thread which has been helpful for knowing what people are working on (though a bit confusing at times). Thanks so much for your work on these packages!

kthyng commented 3 years ago

@TomAugspurger In what sort of timeline might you plan to have your xarray-assets working? I'm thinking about when to integrate it into my catalog generation.

TomAugspurger commented 3 years ago

Sorry for the delay. Let me get https://github.com/intake/intake-stac/pull/90 updated now, it shouldn't take too long.

That's also adding a new API for loading collection-level assets that I'd appreciate some input on. See https://github.com/intake/intake-stac/pull/90#issuecomment-868484712 (cc @jhamman and @scottyhq if either of you have thoughts).

And if that API discussion is a sticking point, I'll see if I can extract the xarray-assets section.

TomAugspurger commented 3 years ago

@kthyng we can continue debugging the issue from https://github.com/intake/intake-stac/pull/90#issuecomment-947137580 here.

I noticed the snippet I shared was incorrect, I had the wrong key. Instead of properties['xarray_kwargs'] = {'drop_variables': 'obs'} is should be

properties['xarray:open_kwargs'] = {'drop_variables': 'obs'}

I wonder if that would fix it for you?

If you're able to share the STAC item (and maybe NetCDF file too) publicly I'd be happy to take a closer look.

kthyng commented 3 years ago

Thanks @TomAugspurger! Right I should have updated that but wasn't sure if you did intend to have it like that since that is the syntax I've seen in ... intake-xarray I think? (Which hasn't worked for me.)

I tried with properties['xarray:open_kwargs'] = {'drop_variables': 'obs'} (or actually two variables to drop) and when I then opened the dataset with .to_dask() from intake I still have the variables present, so it doesn't seem to be working.

Here these should work for you. I updated the file name in the json file to match test.nc — hopefully I'm not missing anything to make it inconsistent. test-item.json: https://drive.google.com/file/d/1-6V7F5Fvk9RZLLvBwcjnIDh7FuruvrcF/view?usp=sharing test.nc: https://drive.google.com/file/d/1-3E-Q3sLYhjvun7QWmxC1fFFCDXH4d0Z/view?usp=sharing

kthyng commented 3 years ago

@TomAugspurger I thought of a few more things. I have been using sat-search instead of pystac-client to query the STAC catalog from the server I have because it works better with the intake GUI which I'm using in a Jupyter notebook. However, I saw that you used pystac-client in your Pangeo presentation today and that you were able to "query" for specific items that were probably defined in an extension. I think I am finally understanding extensions as defined places to stick specific metadata, which pystac-client can then look for if desired. I don't know why it took me so long to figure this out, but going through your specific example helped. So, now I really want to get extensions working! I will want to use datacube also to help define my datasets in time, space and with variables (though they are time series instead of cubes).

kthyng commented 3 years ago

@TomAugspurger Sorry for all the notes but I wanted to come back to one difference between what we have each been saying with "properties" vs. "extra_fields" (it is "properties" for the item and "extra_fields" for the asset apparently for pystac to accept the inputs). I don't know if the difference is significant but didn't want to neglect it just in case. Here is sample code for how I define an item and then its asset using pystac (all of this essentially following a pystac tutorial:

item = pystac.Item(id=dataset_id,
                     geometry=footprint,
                     bbox=bbox,
                     datetime=start_date,
                     properties={},
                     stac_extensions=[xarray_assets_url])

properties = {'fields': fields, 'plots': plots, 'varnames': keysused, 
                       'data_vars': list(ds.data_vars.keys()),
                       'start_datetime': start_date.strftime("%Y-%m-%dT%H"), 
                       'end_datetime': end_date.strftime("%Y-%m-%dT%H"),
                       'xarray:open_kwargs': {'drop_variables': ['obs', 'driver_timestamp']}})

asset = {'href': f'file://{file}', 'media_type': "application/netcdf", 'title': title2,
             'extra_fields': properties, 'roles': ['data']} 

item.add_asset(
        key=item.id,
        asset=pystac.Asset(**asset)
    )
TomAugspurger commented 3 years ago

@kthyng thanks for sharing those files. Can you try with this for the xarray:open_kwargs part of your test-item.json (I just edited it manually)?

      "xarray:open_kwargs": {
        "xarray_kwargs": {
          "drop_variables": [
            "obs",
            "driver_timestamp"
          ]
        }
      },

I think in your pystac code it'll be

properties = {'fields': fields, 'plots': plots, 'varnames': keysused, 
                       'data_vars': list(ds.data_vars.keys()),
                       'start_datetime': start_date.strftime("%Y-%m-%dT%H"), 
                       'end_datetime': end_date.strftime("%Y-%m-%dT%H"),
                       'xarray:open_kwargs': {"xarray_kwargs": {'drop_variables': ['obs', 'driver_timestamp']}}})

but that's untested.

To unpack what's going on there:

  1. the xarray-assets STAC extension puts all it stuff under xarray:open_kwargs (and xarray:storage_options)
  2. The main branch of intake-stac is now looking for that xarray:open_kwargs and passes them through to the intake driver (intake_xarray.netcdf.NetCDFSource in this case)
  3. NetCDFSource has a keyword, xarray_kwargs which is where we want to pass drop_variables

So there's multiple layers, with similar but slightly different names. With that change to the item, I have

In [1]: import intake

In [2]: item = intake.open_stac_item("test-item.json")

In [3]: ds = item[list(item)[0]].to_dask()

In [4]: "obs" in ds.variables
Out[4]: False

I'll take a look at your question about sat-search a later.

kthyng commented 3 years ago

@TomAugspurger Thanks again, this is really clear and you are helping my a ton, I appreciate it.

Ok so I edited the file directly as you suggested and at first I got "obs" in ds.variables as True. This was with

intake-stac               0.3.0                      py_0    conda-forge

Then I installed the dev version of intake-stac from github and indeed did get "obs" in ds.variables as False! So to be clear that is with

intake-stac               0.3.0.post85             pypi_0    pypi

So looks like it is important to use the newer changes in intake-stac!

I also verified that I could modify the code as you wrote with

'xarray:open_kwargs': {"xarray_kwargs": {'drop_variables': ['obs', 'driver_timestamp']}}

and then was able to have the keywords come through as checked by:

import pystac_client
import intake

api = pystac_client.Client.open(URL)
results = api.search()
cat = intake.open_stac_item_collection(results.get_all_items())

# both item and asset named "test-item"
cat['test-item']['test-item'].to_dask()  # `driver_timestamp` not present

Ok so this fixes many of my problems but I do have a couple of questions:

  1. How can I query using pystac-client like you did in your example presentation? Do you query any text in the item or do you specifically query known metadata keywords that are defined in extensions? Could you share example code? I screenshotted that point in your presentation but my computer apparently doesn't save screenshots anymore (?!) so I lost it.
  2. Can I use a similar approach as this for using the datacube extension? It seems that this would be a convenient way for me to (with my question 1) be able to search on the attributes of the dataset. What would the syntax be to get it to come all the way through to intake?

(I think this all supersedes the other notes I gave before and about sat-search too — I think I'll be able to work around the issue I had with pystac-client from before given these new capabilities!)

TomAugspurger commented 3 years ago

About sat-search vs. pystac-client, I'd recommend pystac-client. It's the library being actively maintained these days.

How can I query using pystac-client like you did in your example presentation?

There's an example at https://planetarycomputer.microsoft.com/docs/tutorials/cloudless-mosaic-sentinel2/#Discover-data using query.

Using pystac-client (or sat-search) currently requires a STAC API endpoint to query against. In other words, it doesn't work with a static STAC catalog (a list of STAC items on disk). There's a feature request at https://github.com/stac-utils/pystac-client/issues/66 for pystac-client to work with static STAC catalogs, but it's not possible today.

Can I use a similar approach as this for using the datacube extension?

I'm not sure, because 1.) I've only ever filtered on basic things like eo:cloud_cover, and 2.) it depends on the API and its backend. I haven't filtered on nested properties like "give me items where the cube:variables mapping contains "wind_speed"`. It seems like https://github.com/radiantearth/stac-api-spec/tree/master/fragments/query allows for "contains", but the Planetary Computer API / backend doesn't allow it right now.

The STAC API item search specification is under some flux right now, so this could change over the next few months / years.

kthyng commented 3 years ago

Thanks for all this information @TomAugspurger. The original question has been answered, which is that if you are using pystac-client to filter your dynamic STAC catalog, you can include code like 'xarray:open_kwargs': {"xarray_kwargs": {'drop_variables': ['obs', 'driver_timestamp']}} in your item to pass keyword arguments to xr.open_dataset.

The other questions are more long term. pystac-client currently doesn't work with intake to be able to do a basic text search but sat-search does, so I will probably be digging into how to improve that. Alternatively, the search in intake won't be necessary if I can in the future query on metadata specified in extensions with pystac-client itself. We've been using franklin to serve a dynamic STAC catalog so perhaps that is where the development would need to occur to allow for this behavior.

Thanks again for your help and I'm sure I'll be back around soon!