nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
416 stars 83 forks source link

Give users information/control over granules without RelatedUrls #344

Open mfisher87 opened 12 months ago

mfisher87 commented 12 months ago

Today, working with 0080v2, I found some granules without RelatedUrls which caused an exception within search_data(). I release 0.7.1 to handle this case without failing, but we still should talk about user needs in this situation!

I figure we should probably log a warning. Or is this always a case where CMR results are wrong and shouldn't be used? In the case of 0080, it seems to me like something is wrong with the CMR ingest process or something. In which case, we should probably error but provide an escape hatch like no_links_ok=True.

Any thoughts?

MattF-NSIDC commented 12 months ago

I talked with Daniel Crumly about this (does he have a GitHub account? trying to @ him :) ). Learned the following from that discussion:

These 0080v2 granuels are duplicate granules that are cleaned up roughly weekly by a manual process. In our case, we want to ignore them. The v0.7.1 release enables us to do that on the client side. We also found we can pass downloadable=True to exclude these granules from query results.

I still feel we need to communicate better with the user (both in logged messages and in API design) in this situation so they can make good decisions without needing to have access to directly ask questions of NSIDC employees :)

andypbarrett commented 12 months ago

Are there cases (for other datasets) where granules without RelatedUrls are not duplicates?

I agree that the reason for checking for RelatedUrls needs to be documented so that we do not have to remember why we needed to check this.

However, if all valid granules should have RelatedUrls and all granules without are duplicates and will be cleaned up, I don't think we need to notify users about these granules. Doing so may very likely confuse rather than clarify; it would lead me to think that there is data but I just can't get to it.

If there are cases where valid granules (not duplicates) do not have RelatedUrls, then users should be notified and provided with information about how to get to the data.

I would ask how does Earthdata Search deal with these cases?

MattF-NSIDC commented 12 months ago

However, if all valid granules should have RelatedUrls and all granules without are duplicates and will be cleaned up, I don't think we need to notify users about these granules. Doing so may very likely confuse rather than clarify; it would lead me to think that there is data but I just can't get to it.

Sounds like you're thinking we should filter out those granules by default, right? I like it. If we go that way, maybe we should provide a way for users to disable that behavior in case they're using earthaccess to get an accurate representation of CMR's true state. include_granules_with_missing_links=True?

I would ask how does Earthdata Search deal with these cases?

That's a really good question. Maybe in #eosdis-edsc we can ask. Or, is the EDSC code open-source now? I think last I looked it was on a private earthdata Bitbucket.

andypbarrett commented 12 months ago

I do think filtering those granules should be the default behaviour, assuming that those granules are "to be cleaned up".

MattF-NSIDC commented 12 months ago

I'm not sure if there's a way to tell the difference, honestly! I'll ask Daniel about this.

MattF-NSIDC commented 12 months ago

From Daniel (are these always duplicates or "to be cleaned up"?):

Not necessarily - it can also be data that is hidden for direct download access, but is still orderable through the system. For ECS, this is a reasonable distinction, but for cloud, where we don't currently have an ordering system, this may have a different meaning. For the short version - if you're looking for direct download access, and it doesn't have a URL, that means you won't be able to access it via that path, and downloadable=true is the right move. If you can access another access mechanism (eg, EGI or COBI), then downloadable=true may filter out results that could be accessed another way. This will include duplicates, but can include data that is hidden from direct download for other reasons (which are quite varied).

andypbarrett commented 12 months ago

Is there anything in the CMR metadata to distinguish between these two cases. Something like

if not downloadable and alternative_access_mechanism:
    return dataset
MattF-NSIDC commented 12 months ago

Daniel:

Yes, you can also filter on associated services at the collection level. If you grab the collection record first, and sift for services that your application knows how to invoke, eg:

https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_type\[\]=EGI%20-%20No%20Processing

or

https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_name=NSIDC_ECS_DAAC_HTTPS_File_System

(currently available types in ECS are ESI, ECHO ORDERS, and EGI - No Processing, but there are lots of other services/tools; the NSIDC_ECS_DAAC_HTTPS_File_System service is actually the one you're invoking when you do direct downloads, but this name would be different for different DAACs/download services, eg, cloud). If none of the services you're interested in talking to are available, then you know your only option is looking for the RelatedUrls to see if one of those is to your liking. So, to replicate the above in pseudocode, you could do something like:

if (https://cmr.earthdata.nasa.gov/search/collections.umm_json?short_name=NSIDC-0080&version=2&service_type\[\]=EGI%20-%20No%20Processing returns my collection)
then
pull the service record URL that I'm interested in and use my specialized knowledge to craft the API call for that type
for each granule https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=NSIDC-0080&version=2&bounding_box=-180.0,-90.0,180.0,0.0&temporal%5B%5D=2023-10-21T00:00:00Z,2023-11-08T00:00:00Z&readable_granule_name=*25km_*&options%5Breadable_granule_name%5D%5Bpattern%5D=true
bundle up the GranuleURs into 2000 granule chunks
for each 2000 granule chunk
make a single asynchronous call to the API to place a request for the granules.
else
for each granule https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=NSIDC-0080&version=2&bounding_box=-180.0,-90.0,180.0,0.0&temporal%5B%5D=2023-10-21T00:00:00Z,2023-11-08T00:00:00Z&readable_granule_name=*25km_*&options%5Breadable_granule_name%5D%5Bpattern%5D=true&downloadable=true
Download the granule files from the GET DATA typed and DIRECT DOWNLOAD subtyped RelatedUrls