nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
414 stars 82 forks source link

Downloads for AU_SI12_NRT_R04 incorrect #307

Closed trey-stafford closed 19 hours ago

trey-stafford commented 1 year ago

I am trying to download granules of AU_SI12_NRT_R04 using earthaccess.download but the results are incorrect. Files are created on disk but they do not seem to contain the data.

import earthaccess

results = earthaccess.search_data(short_name='AU_SI12_NRT_R04')
results = sorted(results, key=lambda x: x['meta']['revision-date'], reverse=True)
earthaccess.login()
files = earthaccess.download(results, "/tmp/test")
Granules found: 14
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 11/25/2023
Using environment variables for EDL
 Getting 14 granules, approx download size: 0.0 GB
QUEUEING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 2188.52it/s]
PROCESSING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.15it/s]
COLLECTING RESULTS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 160877.41it/s]

This results in files of ~4.1K in size in the indicated /tmp/test directory. I expect files ~126M in size:

$ ls -lah /tmp/test/
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_P04_20230926.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230913.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230914.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230915.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230916.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230917.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230918.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230919.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230920.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230921.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230922.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230923.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230924.he5
-rw-rw-r--  1 trst2284 trst2284 4.1K Sep 26 13:15 AMSR_U2_L3_SeaIce12km_R04_20230925.he5

I haven't dug into this very deeply yet, but I found the code in earthaccess.store that is responsible for downloading files and set a breakpoint here:

(Pdb) url
'https://lance.nsstc.nasa.gov/amsr2-science/data/level3/seaice12/R04/hdfeos5/AMSR_U2_L3_SeaIce12km_R04_20230923.he5'
(Pdb) pp r.raw
<urllib3.response.HTTPResponse object at 0x7fd661351120>
(Pdb) pp r.content
(Pdb) (b'<!DOCTYPE html>\n<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7'
 b'"> <![endif]-->\n<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8"> <![endi'
 b'f]-->\n<!--[if IE 8]><html class="no-js lt-ie9"> <![endif]-->\n<!--[if gt '
 b'IE 8]><!--><html lang="en" class="no-js"><!--<![endif]-->\n  <head>\n    <'
 b'meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE'
 b'=edge,chrome=1">\n    <title>Earthdata Login</title>\n    <meta name="desc'
 b'ription" content="Earthdata Login">\n    <meta name="viewport" content="w'
 b'idth=device-width, initial-scale=1.0">\n\n    <!-- Google Tag Manager -->\n'
 b"    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push(\n\n      {'gtm.s"
 b"tart': new Date().getTime(),event:'gtm.js'}\n\n    );var f=d.getElementsBy"
 b"TagName(s)[0],\n      j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j"
 b".async=true;j.src=\n      'https://www.googletagmanager.com/gtm.js?id='+i"
 b"+dl;f.parentNode.insertBefore(j,f);\n    })(window,document,'script','dat"
 b"aLayer','GTM-WNP7MLF');</script>\n    <!-- End Google Tag Manager -->\n\n  "
 b'  <link href="https://cdn.earthdata.nasa.gov/eui/1.1.3/stylesheets/applicati'
 b'on.css" rel="stylesheet" />\n    <link rel="stylesheet" href="/assets/app'
 b'lication-432b3917d4a41042c0fd963eba859548ef2993f5ed7a0dca4bdb446fdf807556.cs'
 b's" media="all" />\n    <!--[if IE 7]>\n      <link rel="stylesheet" href="'
 b'/assets/font-awesome-ie7.min.css">\n    <![endif]-->\n    <link href="//ne'
 b'tdna.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="styl'
 b'esheet">\n    <link href=\'https://fonts.googleapis.com/css?family=Source+'
 b'Sans+Pro:300,700\' rel=\'stylesheet\' type=\'text/css\'>\n    <meta name="'
 b'csrf-param" content="authenticity_token" />\n<meta name="csrf-token" cont'
 b'ent="n8PZFKi4E7nfgviamfqUIY_0XanXV1LhoF9ZAbNt1-MUe2mYXetV1EKdM1sGW6RZFUZHiDY'
 b'PycYyLZO6XSE6tg" />\n    \n\n    <!-- Grid background: http://subtlepattern'
 b's.com/graphy/ -->\n  </head>\n  <body class="oauth authorize" data-turboli'
 b'nks-eval=false>\n\n    <!-- Google Tag Manager (noscript) -->\n    <noscrip'
 b't>\n      <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-WN'
 b'P7MLF"\n                    height="0" width="0" style="display:none;visi'
 b'bility:hidden"></iframe>\n    </noscript>\n    <!-- End Google Tag Manager'
 b' (noscript) -->\n\n    <header id="earthdata-tophat2" style="height: 32px;'
 b'"></header>\n    <!--[if lt IE 7]>\n      <p class="chromeframe">You are u'
 b'sing an <strong>outdated</strong> browser. Please <a href="http://browsehapp'
 b'y.com/">upgrade your browser</a> or <a href="http://www.google.com/chromefra'
 b'me/?redirect=true">activate Google Chrome Frame</a> to improve your experien'
 b'ce.</p>\n    <![endif]-->\n    <div class="container">\n      <header role='
 b'"banner">\n  <div id="masthead-logo">\n    <h1><a class="ir" href="/">Eart'
 b'hdata Login</a></h1>\n    <span class="eui-badge badge daac">Earthdata Lo'
 b'gin</span>\n  </div>\n  <a id="hamburger" href="#"><img title="Mobile Menu'
 b'" alt="Three horizontal lines stacked" src="/assets/hamburger-68c8505066427f'
 b'3e3f6ee40b24cfd3c9f7c0fe93ee298b9046564637262115fa.png" /></a>\n  <nav ro'
 b'le="navigation" class="masthead">\n\n    <div id="hide">\n      <ul>\n      '
 b'  <li><strong><a href="/documentation">Documentation</a></strong></li>\n '
 b'     </ul>\n    </div>\n  </nav>\n</header>\n\n      \n\n\n\n\n\n\n\n    '
 b'  <section id="callout-login">\n  <div class="client-login">\n    <img cla'
 b'ss="client-image" border="1" src="/app_image_image/19071" />\n    <br>\n  '
 b'  <h3 class="client-description">\n      \n    </h3>\n\n  </div>\n  <form'
 b' id="login" action="/login" accept-charset="UTF-8" method="post"><input name'
 b'="utf8" type="hidden" value="&#x2713;" autocomplete="off" /><input type="hid'
 b'den" name="authenticity_token" value="A4tAYbi8xu0b3pfSWilc6te24QugNelEl0w-xr'
 b'qa-1uIM_DtTe-AgIbBXBPFiGySTQT7KkFtcmMFPvR9VNYWDg" autocomplete="off" />\n'
 b'  <p><label for="username">Username</label><i class="fa fa-question-circle f'
 b'a-question-circle--blue user-name" title="Login using either your Username o'
 b'r Email Address"></i><input type="text" name="username" id="username" autofo'
 b'cus="autofocus" class="default" /></p>\n  <p><label for="password">Passwo'
 b'rd</label><br /><input type="password" name="password" id="password" autocom'
 b'plete="off" /></p>\n\n  <p><input type="hidden" name="client_id" id="clien'
 b't_id" value="mACp-6quKkkPZ3FiVl2Rng" autocomplete="off" /></p>\n  <p><inp'
 b'ut type="hidden" name="redirect_uri" id="redirect_uri" value="https://lance.'
 b'itsc.uah.edu/urs-redirect" autocomplete="off" /></p>      <p><input type="hi'
 b'dden" name="response_type" id="response_type" value="code" autocomplete="off'
 b'" /></p>\n      <p><input type="hidden" name="state" id="state" value="aH'
 b'R0cHM6Ly9sYW5jZS5pdHNjLnVhaC5lZHUvYW1zcjItc2NpZW5jZS9kYXRhL2xldmVsMy9zZWFpY2'
 b'UxMi9SMDQvaGRmZW9zNS9BTVNSX1UyX0wzX1NlYUljZTEya21fUjA0XzIwMjMwOTIzLmhlNQ" au'
 b'tocomplete="off" /></p>\n      <p><input type="checkbox" name="stay_in" i'
 b'd="stay_in" value="1" checked="checked" /> <label for="stay_in">Stay signed '
 b'in (this is a private workstation)</label></p>\n\n  <p class="button-with-'
 b'notes">\n    <input type="submit" name="commit" value="Log in" class="eui'
 b'-btn--round eui-btn--green" data-disable-with="Log in" />\n    <a class="'
 b'eui-btn--round eui-btn--blue" href="/users/new?client_id=mACp-6quKkkPZ3FiVl2'
 b'Rng&amp;redirect_uri=https%3A%2F%2Flance.itsc.uah.edu%2Furs-redirect&amp;res'
 b'ponse_type=code&amp;state=aHR0cHM6Ly9sYW5jZS5pdHNjLnVhaC5lZHUvYW1zcjItc2NpZW'
 b'5jZS9kYXRhL2xldmVsMy9zZWFpY2UxMi9SMDQvaGRmZW9zNS9BTVNSX1UyX0wzX1NlYUljZTEya2'
 b'1fUjA0XzIwMjMwOTIzLmhlNQ">Register</a>\n  </p>\n  <p class="form-instructi'
 b'ons">\n    <em class="icon-question-sign"></em>\n    <a class="" href="/re'
 b'trieve_info">I don&rsquo;t remember my username</a>\n    <br /><em class='
 b'"icon-question-sign"></em>\n    <a class="" href="/reset_passwords/new">I'
 b' don&rsquo;t remember my password</a>\n    <br />\n    <em class="icon-que'
 b'stion-sign"></em>\n    <a href="javascript:feedback.showForm();" title = '
 b"'Need Help? Click on the Feedback button to request help'>Help</a>\n  </p"
 b'>\n</form>\n<aside class="govt-msg">\n  <div class="nasa-logo"></div>\n  <p>'
 b'<strong>Why must I register?</strong></p>\n  <p>\n    The Earthdata Login '
 b'provides a single mechanism for user registration and profile management for'
 b' all EOSDIS system components (DAACs, Tools, Services).\n    Your Earthda'
 b'ta login also helps the EOSDIS program better understand the usage of EOSDIS'
 b' services to improve user experience through customization of tools and impr'
 b'ovement of services.\n    EOSDIS data are openly available to all and fre'
 b'e of charge except where governed by international agreements.\n  </p>\n</'
 b'aside>\n\n</section>\n<section id="cta">\n  <h3>Get single sign-on access to'
 b' all your favorite EOSDIS sites</h3>\n      <a class="eui-btn--round eui-'
 b'btn--blue" href="/users/new?client_id=mACp-6quKkkPZ3FiVl2Rng&amp;redirect_ur'
 b'i=https%3A%2F%2Flance.itsc.uah.edu%2Furs-redirect&amp;response_type=code&amp'
 b';state=aHR0cHM6Ly9sYW5jZS5pdHNjLnVhaC5lZHUvYW1zcjItc2NpZW5jZS9kYXRhL2xldmVsM'
 b'y9zZWFpY2UxMi9SMDQvaGRmZW9zNS9BTVNSX1UyX0wzX1NlYUljZTEya21fUjA0XzIwMjMwOTIzL'
 b'mhlNQ">Register for a Profile</a>\n</section>\n<div class="govt-warning eu'
 b'i-info-box">\n  <div class="warning-desktop">\n    <p>\n      <strong>\n    '
 b'    Protection and maintenance of user profile information is described '
 b'in\n        <a href="https://www.nasa.gov/about/highlights/HP_Privacy.htm'
 b'l">NASA\'s Web Privacy Policy.</a>\n        </strong> \n    </p>\n  </di'
 b'v>\n  <div class="warning-mobile">\n    <p>\n      <strong>\n        Protect'
 b'ion and maintenance of user profile information is described in\n        '
 b'    <a href="https://www.nasa.gov/about/highlights/HP_Privacy.html">NASA'
 b"'s Web Privacy Policy.</a>\n      </strong> \n    </p>\n  </div>\n  <div cla"
 b'ss="warning-mobile-mini">\n    <strong>\n      US Govt Property. Unauthori'
 b'zed use subject to prosecution. Use subject to monitoring per\n      <a h'
 b'ref="https://nodis3.gsfc.nasa.gov/displayDir.cfm?t=NPD&c=2810&s=1E">NPD2810<'
 b'/a>.\n    </strong>\n  </div>\n</div>\n\n\n    </div>\n    <footer role="co'
 b'ntentinfo">\n  <h3>For questions regarding the EOSDIS Earthdata Login, pl'
 b'ease contact <a href="javascript:feedback.showForm();" title="Earthdata Supp'
 b'ort form">Earthdata Support</a></h3>\n  <ul>\n    <li class="version badge'
 b' eui-badge--md">V 4.180.0\n</li>\n    <li><a href="/">Home</a></li>\n    <l'
 b'i><a href="/users/new">Register</a></li>\n    <li><a title="NASA Home" hr'
 b'ef="http://www.nasa.gov">NASA</a></li>\n  </ul>\n  <p>NASA Official: Steph'
 b'en Berrick</p>\n</footer>\n\n    <script src="/assets/application-26ef2d894'
 b'36774b62209186400ab34914d3661de4b009da594e25783d8575bad.js"></script>\n  '
 b'  <script type="text/javascript">\n    $(window).scroll(function(e){\n    '
 b'  parallax();\n    });\n    function parallax(){\n      var scrolled = $(wi'
 b"ndow).scrollTop();\n      $('#content').css('background-position', 'right"
 b' \' + -(scrolled*0.25)+\'px \');\n    }\n    </script>\n    <script src="h'
 b'ttps://cdn.earthdata.nasa.gov/tophat2/tophat2.js" id="earthdata-tophat-scrip'
 b't" data-show-fbm="true" data-show-status="true" data-status-api-url="https:/'
 b'/status.earthdata.nasa.gov/api/v1/notifications"></script>\n    <script t'
 b'ype="text/javascript" src="https://fbm.earthdata.nasa.gov/for/URS4/feedback.'
 b'js"></script>\n    <script type="text/javascript">\n      feedback.init();'
 b'\n    </script>\n    <script type="text/javascript">\n        setTimeout(fu'
 b'nction()\n                {var a=document.createElement("script"); var b='
 b'document.getElementsByTagName("script")[0];\n                    a.src=do'
 b'cument.location.protocol+"//dnn506yrbagrg.cloudfront.net/pages/scripts/0013/'
 b'2090.js?"+Math.floor(new Date().getTime()/3600000);\n                    '
 b'a.async=true;a.type="text/javascript";b.parentNode.insertBefore(a,b)}\n  '
 b'              , 1);\n    </script>\n\n    <!-- BEGIN: DAP Google Analytics '
 b' -->\n    <script language="javascript" id="_fed_an_ua_tag" src="https://'
 b'dap.digitalgov.gov/Universal-Federated-Analytics-Min.js?agency=NASA&subagenc'
 b'y=GSFC&dclink=true"></script>\n    <!-- END: DAP Google Analytics  -->\n\n '
 b'   \n  </body>\n</html>\n')
trey-stafford commented 1 year ago

Looks like I'm getting the HTML response for EDL login, maybe I'm not doing something right with auth?

MattF-NSIDC commented 1 year ago

I think there's an issue with the auth endpoint for these granules? After clicking a data link in my browser and being redirected to EDL, I entered my credentials, and then was redirected to https://lance.nsstc.nasa.gov/urs-redirect which gave 403. After doing that, I'm able to go back to the CMR search results and click the data links and see the files.

After logging in once, I get a message like "so and so has been added to your authorized EDL applications". Can you try logging in as the account in question and then clicking the data links in your browser? I hope once the authorization step is done you may have different results.

trey-stafford commented 1 year ago

I have manually downloaded the files with the earthdata account I'm using to authenticate with earthaccess. The results are the same from earthaccess's side.

I'm able to download the granules with some code adapted from qgreenland:

import os

import earthaccess
import requests

_URS_COOKIE = "urs_user_already_logged"
_CHUNK_SIZE = 8 * 1024

def _get_earthdata_creds():
    if not os.environ.get("EARTHDATA_USERNAME"):
        raise RuntimeError("Environment variable EARTHDATA_USERNAME must be defined.")
    if not os.environ.get("EARTHDATA_PASSWORD"):
        raise RuntimeError("Environment variable EARTHDATA_PASSWORD must be defined.")

    return (
        os.environ["EARTHDATA_USERNAME"],
        os.environ["EARTHDATA_PASSWORD"],
    )

def _create_earthdata_authenticated_session(s=None, *, hosts: list[str], verify):
    if not s:
        s = requests.session()

    for host in hosts:
        resp = s.get(
            host,
            # We only want to inspect the redirect, not follow it yet:
            allow_redirects=False,
            # We don't want to accidentally fetch any data:
            stream=True,
            verify=verify,
        )
        # Copy the headers so they can be used case-insensitively after the
        # response is closed.
        headers = {k.lower(): v for k, v in resp.headers.items()}
        resp.close()

        redirected = resp.status_code == 302
        redirected_to_urs = (
            redirected and "urs.earthdata.nasa.gov" in headers["location"]
        )

        if not (redirected_to_urs):
            print(f"Host {host} did not redirect to URS -- continuing without auth.")
            return s

        auth_resp = s.get(
            headers["location"],
            # Don't download data!
            stream=True,
            auth=_get_earthdata_creds(),
        )
        resp.close()
        if not (auth_resp.ok and s.cookies.get(_URS_COOKIE) == "yes"):
            msg = f"Authentication with Earthdata Login failed with:\n{auth_resp.text}"
            raise RuntimeError(msg)

        print(f"Authenticated for {host} with Earthdata Login.")

    return s

def _download_lance_files():
    results = earthaccess.search_data(short_name="AU_SI12_NRT_R04")

    for granule in results:
        # There are two links for each granule. one for lance.nsstc.nasa.gov and
        # the other for lance.itsc.uah.edu. The first one is fine.
        url = granule.data_links(access="external")[0]
        session = _create_earthdata_authenticated_session(hosts=[url], verify=True)
        with session.get(
            url,
            timeout=60,
            stream=True,
            headers={"User-Agent": "NSIDC-dev-trst2284"},
        ) as resp:
            # e.g., https://lance.nsstc.nasa.gov/.../AMSR_U2_L3_SeaIce12km_P04_20230926.he5
            # -> AMSR_U2_L3_SeaIce12km_P04_20230926.he5
            fn = url.split("/")[-1]
            with open(f"/tmp/test/{fn}", "wb") as f:
                for chunk in resp.iter_content(chunk_size=_CHUNK_SIZE):
                    f.write(chunk)

            print(f"wrote {fn}")

if __name__ == "__main__":
    _download_lance_files()
asteiker commented 22 hours ago

@trey-stafford @MattF-NSIDC We're unsure if this issue was resolved by #308 or not. Or potentially this is an issue outside of earthaccess and more of an issue with the collection's auth endpoint?

trey-stafford commented 21 hours ago

@asteiker , no, #308 didn't resolve this. #308 contained some fixups that I found while debugging the problem, but never found a solution.

I'm testing with the latest version of earthaccess right now though, and it seems like it might be working. I'm running into another issue though: these granules have two data links which have the same data. One is pretty fast but the other link is slow to download.

I'll wait until my test is completed to inspect the files and verify the look correct, and then if they do, we can probably close this ticket and open another to address duplicate data links from different mirrors.

trey-stafford commented 21 hours ago

E.g., here are the "duplicate" data links for one of the results:

>>> results[-1].data_links()
[
    'https://lance.nsstc.nasa.gov/amsr2-science/data/level3/seaice12/R04/hdfeos5/AMSR_U2_L3_SeaIce12km_R04_20241016.he5', 
    'https://lance.itsc.uah.edu/amsr2-science/data/level3/seaice12/R04/hdfeos5/AMSR_U2_L3_SeaIce12km_R04_20241016.he5'
]
trey-stafford commented 20 hours ago

I've confirmed that earthaccess v0.11.0 now downloads the data!

Just further confirming that earthaccess is getting and processing multiple links for the same granule:

>>> len(files)
28
>>> len(set(files))
14

The list of downloaded files returned by earthaccess in my original example given above contains duplicates.

asteiker commented 19 hours ago

Thanks for re-testing and confirming that this is now downloading. I'll enter a new Issue on the multiple links.

mfisher87 commented 15 hours ago

Hey @asteiker, gentle reminder that I no longer check the @MattF-NSIDC account :)

asteiker commented 1 hour ago

@mfisher87 yes! So sorry I grabbed the wrong handle yesterday. It pops up automatically for me and looks like I made a few mistakes.