Closed briannapagan closed 2 months ago
For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.
@briannapagan, I'm not sure this totally makes sense as logic that belongs in python_cmr
because there is no way to have the CMR query return only variables that have associated instance_information
.
We must simply filter the results that the CMR returns to us, as you have written in your example. If we were to add such logic to python_cmr
, then that might open us up to adding all sorts of post query filtering for any sort of property that may or may not be returned in the results, but which we cannot specify as part of the CMR query itself. This is logic that I suggest does not belong in this library.
For example, I would suggest simplifying your code to something like so (which would remain in your code, not be put into python_cmr
):
zarr_stores = [var for var in VariableQuery().get_all() if "instance_information" in var]
# OR
zarr_stores = [var for var in VariableQuery().provider(provider).get_all() if "instance_information" in var]
Hi @chuckwondo thanks for the feedback. There are CMR endpoints that we can build into python_cmr
i.e. the end points:
https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr&provider=GES_DISC&keyword=C1225808238-GES_DISC&pretty=true
https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr
We haven't added much to the VariableQuery
, i.e. for other queries we have functions for short_name
for example in queries.py:
def short_name(self, short_name: str) -> Self:
"""
Filter by short name (aka product or collection name).
:param short_name: name of collection
:returns: self
"""
if not short_name:
return self
self.params['short_name'] = short_name
return self
I want the same now for instance_information
What I'm saying is that the CMR does not support querying by instance_information
, so adding instance_information
to the query params won't work. In fact, it will simply generate an error response. For example, the URL https://cmr.uat.earthdata.nasa.gov/search/variables?instance-information=true returns this response:
{
"errors": [
"Parameter [instance_information] was not recognized."
]
}
This means that the only way to perform filtering is after you get your query results back from the CMR. You cannot tell the CMR not to return variables that do or don't have associated instance_information
.
However, if you mean instance-format
, not instance-information
, then sure, open a PR to add an instance_format
method.
Sorry @chuckwondo I meant instance_format
:)
For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.
@liredell this would require for DOI to be searchable by variables, that is a feature that would be requested from CMR.
At GES DISC we are getting ready to expose our production public zarr stores. Lots of open questions remain on how to make these zarr stores easily searchable, especially because we publish zarr stores at the variable, not the collection level.
In umm_json we are specifying
zarr
as theinstance_information
, example here: https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISCI think
python_cmr
so far hasn't brought over many of the functionalities in the CMR API forVariableQuery
and I would like to add a piece of code to help make searching for variable zarr stores more intuitive to the user:Any thoughts/feedback before I suggest a PR to add
instance_information
as an additional query parameter inVariableQuery
?