nasa / python_cmr

Python library for querying the common metadata repository.
MIT License
24 stars 22 forks source link

Support searching for available zarr stores #59

Closed briannapagan closed 2 months ago

briannapagan commented 6 months ago

At GES DISC we are getting ready to expose our production public zarr stores. Lots of open questions remain on how to make these zarr stores easily searchable, especially because we publish zarr stores at the variable, not the collection level.

In umm_json we are specifying zarr as the instance_information, example here: https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISC

I think python_cmr so far hasn't brought over many of the functionalities in the CMR API for VariableQuery and I would like to add a piece of code to help make searching for variable zarr stores more intuitive to the user:

import re
from cmr import VariableQuery 
def get_all_zarr_stores(provider = None):
    api = VariableQuery()
    if provider: 
        all_vars = api.provider(provider).get_all()
    else:
        all_vars = api.get_all()
    zarrs = []
    for variable_entry in all_vars:
        try:
            if variable_entry['instance_information']:
                zarrs.append(variable_entry)
        except KeyError:
            continue
    return zarrs

def query_zarr_stores(zarr_stores,short_name, version, variable=None):
    zarrs = []
    if variable:
        pattern = short_name + '.*' + version + '.*' + variable + '.*'
    else:         
        pattern = short_name + '.*' + version + '.*'

    for store in zarr_stores:
        try:
            if re.match(pattern, store["native_id"]):
                zarrs.append(store)
        except KeyError:
            continue
    return zarrs

# one way to search just knowing provider
provider = "GES_DISC"
zarr_stores = get_all_zarr_stores(provider = provider)

# another way (perhaps more intuitive) where user can input short_name, version and variable
short_name = "GPM_3IMERGHH"
version = "06"
variable = "precipitationCal"
zarr_stores = query_zarr_stores(zarr_stores,short_name,version)

Any thoughts/feedback before I suggest a PR to add instance_information as an additional query parameter in VariableQuery?

liredell commented 6 months ago

For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.

chuckwondo commented 6 months ago

@briannapagan, I'm not sure this totally makes sense as logic that belongs in python_cmr because there is no way to have the CMR query return only variables that have associated instance_information.

We must simply filter the results that the CMR returns to us, as you have written in your example. If we were to add such logic to python_cmr, then that might open us up to adding all sorts of post query filtering for any sort of property that may or may not be returned in the results, but which we cannot specify as part of the CMR query itself. This is logic that I suggest does not belong in this library.

For example, I would suggest simplifying your code to something like so (which would remain in your code, not be put into python_cmr):

zarr_stores = [var for var in VariableQuery().get_all() if "instance_information" in var]
# OR
zarr_stores = [var for var in VariableQuery().provider(provider).get_all() if "instance_information" in var]
briannapagan commented 6 months ago

Hi @chuckwondo thanks for the feedback. There are CMR endpoints that we can build into python_cmr i.e. the end points: https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr&provider=GES_DISC&keyword=C1225808238-GES_DISC&pretty=true https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr

We haven't added much to the VariableQuery, i.e. for other queries we have functions for short_name for example in queries.py:

    def short_name(self, short_name: str) -> Self:
        """
        Filter by short name (aka product or collection name).

        :param short_name: name of collection
        :returns: self
        """

        if not short_name:
            return self

        self.params['short_name'] = short_name
        return self

I want the same now for instance_information

chuckwondo commented 6 months ago

What I'm saying is that the CMR does not support querying by instance_information, so adding instance_information to the query params won't work. In fact, it will simply generate an error response. For example, the URL https://cmr.uat.earthdata.nasa.gov/search/variables?instance-information=true returns this response:

{
  "errors": [
    "Parameter [instance_information] was not recognized."
  ]
}

This means that the only way to perform filtering is after you get your query results back from the CMR. You cannot tell the CMR not to return variables that do or don't have associated instance_information.

chuckwondo commented 6 months ago

However, if you mean instance-format, not instance-information, then sure, open a PR to add an instance_format method.

briannapagan commented 6 months ago

Sorry @chuckwondo I meant instance_format :)

briannapagan commented 6 months ago

For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.

@liredell this would require for DOI to be searchable by variables, that is a feature that would be requested from CMR.