intake / intake-datasets

Catalogs, data packages and resources for Intake
0 stars 1 forks source link

NASA CMR #2

Open jhamman opened 5 years ago

jhamman commented 5 years ago

We (@scottyhq, @apawloski, and others) are interested in building a Intake interface to NASA's CMR. NASA is currently in the early stages of moving its data storage to public cloud (mostly AWS). As part of a NASA funded project, we are interested in building a intake meta-catalog of collections based on CMR.

NASA's "Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs all data and service metadata records for the [Earth Observing System Data and Information Systems (EOSDIS)] system and [is] the authoritative management system for all EOSDIS metadata."

References:

In talking with @martindurant, it sounds like we can probably write a custom Data Source wrapper around CMR. I'm hoping to connect Martin and Andrew here in this issue.

martindurant commented 5 years ago

To start with, I would check out the code of https://github.com/ContinuumIO/intake-sql/blob/master/intake_sql/sql_cat.py#L7 and https://github.com/ContinuumIO/intake-spark/blob/master/intake_spark/spark_cat.py#L6 : to create a catalog class, all you need to do is subclass from Catalog, and provide a _load() method which can create some entry objects. Those entries each create a DataSource, and instance of the given driver, when called.

In the case here, the catalog class itself would have inputs(i.e., the query to execute), and query the server for data resources, parse those resources and make an entry for each one with a type that we know how to load. That requires some understanding of the types of resource that might be returned and the set of driver classes which Intake has available.

scottyhq commented 5 years ago

Thanks for getting this started, as opposed to focusing on the entire CMR catalog, how about just the datasets stored currently on AWS? To narrow the scope even more, it seems that there is convergence on the STAC spec for a minimal lightweight catalog (since different satellites and sensors have so many unique attributes): https://github.com/radiantearth/stac-spec

I know work has already been done to integrate STAC and CMR: https://github.com/Element84/catalog-api-spec/tree/dev/implementations/e84

Although STAC doesn't require a specific format, images stored as Cloud-Optimized-Geotiff would be great!

apawloski commented 5 years ago

Very interested in this, thanks for the connection @jhamman. I'm still getting up to speed on Intake, so apologies for any cloudiness below.

Wondering if this would all be done via intake, or if another component is needed? The CMR is just a metadata search engine -- it points to datasets that sit elsewhere (e.g. at a NASA DAAC), and are stored in different formats.

So a user queries CMR with some set of parameters (area of interest, sensor, start/end time, etc -- see https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html), gets a list of search results, chooses a collection or granule, and accesses it wherever it's stored.

The complicated part is that there is a lot of variance in how and where the actual datasets indexed by CMR are stored. To download from one collection, a user may be able to hit a public link. To download from another, that same user may need to first initiate a session against something like Earthdata login (https://urs.earthdata.nasa.gov/). A granule in one collection could be a netCDF file. In another, GeoTIFF. Etc.

What I'm trying to understand now is where intake lives in this user workflow. Are we talking about an intake-cmr which abstracts the issues above? Are we talking about a bunch of separate intake plugins that are used post-query for the data retrieval portion (e.g. intake-cmr-cog)? A plugin based on data collection? Something else?

I agree with Scott that CMR-indexed datasets in AWS are a good place to start. I think I still need to understand "how much of this can be done via intake?" first though.

martindurant commented 5 years ago

how much of this can be done via intake?

I can certainly answer how much could be done with Intake.

I am imagining a workflow like the following:

cat = intake.open_nasa_cmr({set of user query parameters})
cat.data1 => a public netCDF via opendap
cat.data2 => stack of TIFF on AWS/S3
cat.data3 => file requiring auth from user or environment or other avenue

(and then each query could be an entry in a master catalogue of queries of interest)

Here, one driver is nasa_cmr, responsible for executing the query and making sense of the returned information. Each of the data* entries (which would have better names than this, and descriptions from the information returned by the query) may have different drivers, such as the ones currently in intake_xarray or new ones that need to be written. Formats that Intake doesn't know how to handle either won't appear at all, or won't be usable.

For the login/auth piece, the implementation will depend very much on how the system works, which I'm afraid I know nothing about right now. I assume it can be done, though, via python code and perhaps is as simple as passing storage_options= for S3 or other HTTP header keys.

jhamman commented 5 years ago

This blog post has some interesting background and perhaps some ideas for moving forward: https://medium.com/radiant-earth-insights/the-state-of-stac-talk-and-sprint-3-recap-cd8eda3b8bdb

martindurant commented 5 years ago

You may find this interesting, an intake cat which makes queries on a remote server (mongodb in this case) and generates entries dynamically https://github.com/danielballan/intake-bluesky/blob/master/demo.ipynb

martindurant commented 5 years ago

Finally got around to watching that video. It is indeed rather a lot like a very specialised Intake - so perfect for being interpreted as a catalog object within Intake too, and just reuse the machinery. Then you'd have the geospatial data alongside other Intake stuff.

@apawloski , did you make any progress here, would a conversation early next week help?

digitaltopo commented 5 years ago

I've been exploring intake with the goal of connecting STAC catalogs as well. I'd love to help

martindurant commented 5 years ago

@digitaltopo , @jhamman , @apawloski , how shall we coordinate?

jhamman commented 5 years ago

I would suggest a scheduled call with @digitaltopo, @jhamman , @apawloski, @scottyhq, and @martindurant. If someone has time to sketch out an intake-stack catalog before then, I'm sure it would make for good conversation points.

martindurant commented 5 years ago

I can make time to meet in the first half of the week coming

apawloski commented 5 years ago

Apologies for the late response - I was out this week. I'm available early next week. Maybe Monday or Tuesday afternoon (Eastern Time)?

martindurant commented 5 years ago

I am back online and around for the week. I have a meeting 2:30-3:30 (eastern) today and 2-3 tomorrow, but otherwise free.

scottyhq commented 5 years ago

A call would be great. I'm free this afternoon (EST), or could swing sometime between 3-5 EST tomorrow.

martindurant commented 5 years ago

@scottyhq , happy to meet one-on-one to get things rolling. 4pm eastern?

scottyhq commented 5 years ago

Ok, let's plan for 4pm eastern tomorrow. Hopefully others can join. @martindurant can you set up a meeting link? Thanks!

apawloski commented 5 years ago

I'm interested in joining! 4PM tomorrow sounds good.

martindurant commented 5 years ago

https://appear.in/mdurant

apawloski commented 5 years ago

Thanks for meeting with us @martindurant! It sounds like our path forward will involve a few steps:

  1. We get intake up and running against a STAC API. This means writing search and list methods. We use existing intake-xarray methods for the data retrieval. (We are assuming the data format is already supported.)
  2. Using the same search method, and an updated-as-needed list method, we point towards a STAC CMR interface (like https://github.com/Element84/cmr-stac-api-proxy)
  3. We write an additional driver which supports downloads through an Earthdata Login session (https://earthdata.nasa.gov/about/science-system-description/eosdis-components/earthdata-login).

After this, we could update our search method to support CMR specific queries.