Open mfisher87 opened 12 months ago
I talked with Daniel Crumly about this (does he have a GitHub account? trying to @ him :) ). Learned the following from that discussion:
These 0080v2 granuels are duplicate granules that are cleaned up roughly weekly by a manual process. In our case, we want to ignore them. The v0.7.1 release enables us to do that on the client side. We also found we can pass downloadable=True
to exclude these granules from query results.
I still feel we need to communicate better with the user (both in logged messages and in API design) in this situation so they can make good decisions without needing to have access to directly ask questions of NSIDC employees :)
Are there cases (for other datasets) where granules without RelatedUrls are not duplicates?
I agree that the reason for checking for RelatedUrls needs to be documented so that we do not have to remember why we needed to check this.
However, if all valid granules should have RelatedUrls and all granules without are duplicates and will be cleaned up, I don't think we need to notify users about these granules. Doing so may very likely confuse rather than clarify; it would lead me to think that there is data but I just can't get to it.
If there are cases where valid granules (not duplicates) do not have RelatedUrls, then users should be notified and provided with information about how to get to the data.
I would ask how does Earthdata Search deal with these cases?
However, if all valid granules should have RelatedUrls and all granules without are duplicates and will be cleaned up, I don't think we need to notify users about these granules. Doing so may very likely confuse rather than clarify; it would lead me to think that there is data but I just can't get to it.
Sounds like you're thinking we should filter out those granules by default, right? I like it. If we go that way, maybe we should provide a way for users to disable that behavior in case they're using earthaccess to get an accurate representation of CMR's true state. include_granules_with_missing_links=True
?
I would ask how does Earthdata Search deal with these cases?
That's a really good question. Maybe in #eosdis-edsc
we can ask. Or, is the EDSC code open-source now? I think last I looked it was on a private earthdata Bitbucket.
I do think filtering those granules should be the default behaviour, assuming that those granules are "to be cleaned up".
I'm not sure if there's a way to tell the difference, honestly! I'll ask Daniel about this.
From Daniel (are these always duplicates or "to be cleaned up"?):
Not necessarily - it can also be data that is hidden for direct download access, but is still orderable through the system. For ECS, this is a reasonable distinction, but for cloud, where we don't currently have an ordering system, this may have a different meaning. For the short version - if you're looking for direct download access, and it doesn't have a URL, that means you won't be able to access it via that path, and
downloadable=true
is the right move. If you can access another access mechanism (eg, EGI or COBI), thendownloadable=true
may filter out results that could be accessed another way. This will include duplicates, but can include data that is hidden from direct download for other reasons (which are quite varied).
Is there anything in the CMR metadata to distinguish between these two cases. Something like
if not downloadable and alternative_access_mechanism:
return dataset
Daniel:
Yes, you can also filter on associated services at the collection level. If you grab the collection record first, and sift for services that your application knows how to invoke, eg:
https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_type\[\]=EGI%20-%20No%20Processing
or
https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_name=NSIDC_ECS_DAAC_HTTPS_File_System
(currently available types in ECS are
ESI
,ECHO ORDERS
, andEGI - No Processing
, but there are lots of other services/tools; theNSIDC_ECS_DAAC_HTTPS_File_System
service is actually the one you're invoking when you do direct downloads, but this name would be different for different DAACs/download services, eg, cloud). If none of the services you're interested in talking to are available, then you know your only option is looking for the RelatedUrls to see if one of those is to your liking. So, to replicate the above in pseudocode, you could do something like:if (https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_type\[\]=EGI%20-%20No%20Processing returns my collection) then pull the service record URL that I'm interested in and use my specialized knowledge to craft the API call for that type for each granule https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=NSIDC-0080&version=2&bounding_box=-180.0,-90.0,180.0,0.0&temporal%5B%5D=2023-10-21T00:00:00Z,2023-11-08T00:00:00Z&readable_granule_name=*25km_*&options%5Breadable_granule_name%5D%5Bpattern%5D=true bundle up the GranuleURs into 2000 granule chunks for each 2000 granule chunk make a single asynchronous call to the API to place a request for the granules. else for each granule https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=NSIDC-0080&version=2&bounding_box=-180.0,-90.0,180.0,0.0&temporal%5B%5D=2023-10-21T00:00:00Z,2023-11-08T00:00:00Z&readable_granule_name=*25km_*&options%5Breadable_granule_name%5D%5Bpattern%5D=true&downloadable=true Download the granule files from the GET DATA typed and DIRECT DOWNLOAD subtyped RelatedUrls
Today, working with 0080v2, I found some granules without RelatedUrls which caused an exception within
search_data()
. I release 0.7.1 to handle this case without failing, but we still should talk about user needs in this situation!I figure we should probably log a warning. Or is this always a case where CMR results are wrong and shouldn't be used? In the case of 0080, it seems to me like something is wrong with the CMR ingest process or something. In which case, we should probably error but provide an escape hatch like
no_links_ok=True
.Any thoughts?