nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
388 stars 75 forks source link

Enhance search_data in EarthAccess to Include Associated XML Paths #367

Open emanueleromito opened 9 months ago

emanueleromito commented 9 months ago

I'm currently using the earthaccess library to access MODIS data in my project. In my workflow, I use both HDF paths and the XML paths associated with the HDF files. However, when I use the search_data function from the library, the results only provide the HDF paths.

import earthaccess

results = earthaccess.search_data(
    provider=provider,
    short_name='MCD12Q1',
    count=10
)

uri = granule.data_links()

And what I get is: ['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MCD12Q1.061/MCD12Q1.A2001001.h01v09.061.2022146025902/MCD12Q1.A2001001.h01v09.061.2022146025902.hdf']

This is certainly fine, but it would be nice to have an option that gives you access to the .xml-related file also, or at least the capability to download that file passing the DataGranule related to the hdf file.

mfisher87 commented 9 months ago

Thanks for the report!

I think it'd be awesome if we provided an easily-accessible escape hatch to view the raw CMR results that earthaccess queried for situations like this where our assumptions don't line up with end-users' use cases. Without the escape hatch, users have to wait to use earthaccess until we adapt to support their use case. With the escape hatch, they can begin using earthaccess with a minor "hack" and later on remove it when we support their use case fully.

What do you all think? I don't think we currently support this, but maybe we do, I just didn't find it in the docs and am not planning on source diving today :)

I'm thinking the implementation might be DataGranule having a .raw or .cmr_json attribute/property that contains the parsed JSON from CMR for that granule. Same for collections!

betolink commented 9 months ago

Hi @emanueleromito,

All that information is still available in the results, earthaccess is only accessing part of it. To get to the XML companion files we can do something like this:

import earthaccess

earthaccess.login()

results = earthaccess.search_data(
    short_name="MCD12Q1",
    count=10
)

for granule in results:
    print(granule["umm"]["RelatedUrls"])

all the granules have a "meta" and a "umm" dictionaries with all the data we need. If you want to filter only those XML and hdf we can download them with:

links = []
for granule in results:
    urls = [link["URL"] for link in granule["umm"]["RelatedUrls"] if (link["URL"].endswith((".xml", ".hdf")) and link["URL"].startswith("https"))]
    links.extend(urls)

earthaccess.download(links, "./MCD12Q1")

and that's it, let us know if this works for you.

MattF-NSIDC commented 9 months ago

all the granules have a "meta" and a "umm" dictionaries with all the data we need.

Awesome! This does appear to be undocumented. Or perhaps a limitation of search. I'm thinking we could use a how-to on this. Or perhaps we should expose those as properties that will be picked up by our API autodoc setup? Or both. #368