nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
386 stars 74 forks source link

Error downloading S3 file from SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205 dataset #349

Open jrbourbeau opened 9 months ago

jrbourbeau commented 9 months ago

Downloading HTTPS files with

import earthaccess

results = earthaccess.search_data(
    short_name="SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205", count=10
)
files = results[0].data_links(access="indirect")
out = earthaccess.download(files, "testdownload")
print(f"{out = }")

works as expected with output

Granules found: 2207
QUEUEING TASKS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1156.09it/s]
PROCESSING TASKS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.45s/it]
COLLECTING RESULTS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19972.88it/s]
out = ['testdownload/ssh_grids_v2205_1992101012.nc']

When I then switch to downloading S3 URLs directly (making sure to run in us-west-2 too)

import earthaccess

results = earthaccess.search_data(
    short_name="SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205", count=10
)
files = results[0].data_links(access="direct")
out = earthaccess.download(files, "testdownload")
print(f"{out = }")

I get the following output (and no files downloaded):

Granules found: 2207
earthaccess can't yet guess the provider for cloud collections, we need to use one from earthaccess.list_cloud_providers()
out = None

it looks like we're having trouble inferring the data provider, so if I then manually specify provider="PODAAC"

import earthaccess

results = earthaccess.search_data(
    short_name="SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205", count=10
)
files = results[0].data_links(access="direct")
out = earthaccess.download(files, "testdownload", provider="PODAAC")
print(f"{out = }")

I get an error

Granules found: 2207
Accessing cloud dataset using provider: PODAAC
Credentials for the cloud provider None are not available
Traceback (most recent call last):
  File "/scratch/test_download.py", line 4, in <module>
    out = earthaccess.download(files, "testdownload", provider="PODAAC")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/earthaccess/api.py", line 176, in download
    results = earthaccess.__store__.get(granules, local_path, provider, threads)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/earthaccess/store.py", line 463, in get
    files = self._get(granules, local_path, provider, threads)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/earthaccess/store.py", line 515, in _get_urls
    s3_fs = self.get_s3fs_session(provider=provider)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/earthaccess/store.py", line 226, in get_s3fs_session
    key=s3_credentials["accessKeyId"],
        ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'accessKeyId'
jrbourbeau commented 9 months ago

Okay, so this is maybe user error. If I use provider="POCLOUD" instead of provider="PODAAC", then the S3 download case works. I was able to figure this out my looking at the list of DAACs defined here

https://github.com/nsidc/earthaccess/blob/7db2e59fb76d9eea87a343bbf2af505a57c43e10/earthaccess/daac.py#L6

Naively I don't really care whether the provider is "PODAAC" or "POCLOUD". My natural thought is "This data is coming from the PODAAC data center, earthaccess can you please download it for me?". I'm also new to this space, so maybe most people find the "PODAAC" vs. "POCLOUD" distinction clear and obvious. I'm curious to get thoughts from others who are more familiar with these things, and the earthaccess target user base, than I am.

jhkennedy commented 9 months ago

@jrbourbeau I work at a DAAC and also find this confusing!

CMR itself doesn't really have a concept of "organization", what we naturally think of when we see a name like "PODAAC". CMR, instead, just has a concept of "providers", where each organization may have one or more providers:

I think, that before the cloud transition, "provider" essentially was synonymous with "organization," and so all the older, and therefore on-prem data is typically provided by a provider with the same name as the DAAC. Some^ (most?) DAACs decided to add a second provider to represent their cloud collections, and we're stuck with confusion between organization and provider, where the provider is really overloaded to mean "on prem" vs "cloud" at this point.

All that's background, but it would be nice to more elegantly handle org -> provider for users and better infer which provider is appropriate for a dataset/usecase.


^ At ASF we only have one provider, and we use a "datapool" app to abstract data location in the CMR records, so you have to actually inspect the records for S3 info to determine if it's in the cloud, or just "know". This has a litany of it's own problems but does mean we avoid the above confusion.

jrbourbeau commented 9 months ago

Thanks for the extra context @jhkennedy, that's useful.

Do DAACs have multiple cloud or on-prem-providers (e.g. "cloud-providers": ["POCLOUD", "POCLOUD2", "POCLOUD_for_real"])? I'm wondering if users can just specify the org (e.g. "PODAAC") and then earthaccess can dispatch to the on-prem or cloud provider accordingly.

jhkennedy commented 9 months ago

Do DAACs have multiple cloud or on-prem-providers (e.g. "cloud-providers": ["POCLOUD", "POCLOUD2", "POCLOUD_for_real"])?

Right now, not that I am aware of but I haven't inventoried them. There is definitely nothing preventing them from doing so.

I'm wondering if users can just specify the org (e.g. "PODAAC") and then earthaccess can dispatch to the on-prem or cloud provider accordingly.

Yeah, this would be nice, esp. since DAACs are generally organized by subject-area. @betolink thoughts here?

betolink commented 9 months ago

When we deal with the full record, earthaccess has some hacky code to derive the providers based on the DAAC. Perhaps we need to build a method to derive the provider from URL signatures... e.g.

s3://nsidc-cumulus-prod-public/ATLAS/ATL03/006/2018/10/14/ATL03_20181014000347_02350101_006_02_BRW.h5"

We can see that this is nsidc, just by pattern matching.

s3://podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205/ssh_grids_v2205_1992101512.nc

same here.

Perhaps just a dictionary of S3 bucket -> provider endpoint would work here? @jhkennedy @jrbourbeau @MattF-NSIDC

mfisher87 commented 9 months ago

I'm concerned that could be fragile. If that's an implementation detail and not a well-documented interface, it could change from under us.

Even if it is well-documented and considered "part of the public API" for the purpose of versioning and deprecation announcements, it feels like we should be able to answer this question by a more explicit CMR query so we don't need special logic that knows about URL structure.

:wave: "Hey CMR, what provider provides 'direct' access to {short_name}?" :wave:

I don't have the time today to dive in to the CMR docs and research this, I just feel that's something it should be able to answer in a perfect world ;) Then earthaccess wouldn't have to store info about DAACs and which providers map to which DAAC.