yuvipanda commented 1 year ago

Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.

However, sometimes, in some clients, you get a generic 403 Forbidden here without much explanation. It has something to do with other auth being sent alongside (see https://github.com/nsidc/earthaccess/issues/187 for more vague info).

We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.

yuvipanda commented 1 year ago

You actually get a 400, and here is the smallest sample case:

import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

auth = aiohttp.BasicAuth(username, password)

async def main():
    async with aiohttp.ClientSession(auth=auth) as session:
        async with session.get(url) as response:
            print(response.status)
            print((await response.read())[:30])

asyncio.run(main())

When running from inside us-west-2, this prints:

400
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code><Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>Basic eXV2aXBhbmRhOmFpc2hlZTh3b29naGFobmdpZW1vb3Nob0thaXhpaWJl</ArgumentValue><RequestId>XM26KTSJ4X85W6YR</RequestId><HostId>gjjlJGJmgjalTBXzAnnMg4eBl2MCd3k9UD4klvAO3Rjd18TOB3QCgDC3bAMwciPyIRrStqrD4SQ=</HostId></Error>

which is pretty clear and useful!

And on my laptop, this prints:

200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08\x00\x04\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

So we have a reproducible setup now. fsspec uses aiohttp under the hood, so this is the same issue fsspec is facing

yuvipanda commented 1 year ago

This is likely the aiohttp bug actually: https://github.com/aio-libs/aiohttp/issues/2610

yuvipanda commented 1 year ago

Here is the same code with requests:

import requests
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])

This actually produces the correct output on both my laptop and on us-west-2!

200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08'

This is most likely because requests implemented https://github.com/request/request/pull/1184, while the equivalent bug with aiohttp is still open.

This is amazing news, as this means that fixing https://github.com/aio-libs/aiohttp/issues/2610 should get fsspec to work, which means most of the pangeo stack would work after that. It will still have lower performance than using s3 directly when in us-west-2, so work there still needs to be done. But this will at least make sure regular https URLs work when both inside and outside us-west-2

yuvipanda commented 1 year ago

aiohttp has documented this should not be the case, based on the note here: https://docs.aiohttp.org/en/stable/client_advanced.html?highlight=redirects#custom-request-headers

yuvipanda commented 1 year ago

I also looked at the request being made by aiohttp, and see the following:

RequestInfo(url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf6'), method='GET', headers=<CIMultiDictProxy('Host': 'nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.9 aiohttp/3.8.3', 'Authorization': 'Basic <removed>')>, real_url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf
'))

So I think this confirms that the Authorization header is being retained during redirects.

yuvipanda commented 1 year ago

I've now found @betolink's comment in https://github.com/aio-libs/aiohttp/issues/5783#issuecomment-981958210, and made me realize that what we want is for the credentials to be forwarded when we are redirected to earthdata login, but then dropped. But what we are getting is instead it being sent to everything

yuvipanda commented 1 year ago

AHA, so what's actually happening is that we are setting the basic auth on the session, rather than on the request. So it's being sent to every request from the session, including S3! This actually now is unrelated to the aiohttp bug

yuvipanda commented 1 year ago

if I move the auth= to just the request, I get a basic 401 denied, as the Basic auth is dropped during the redirect, which is correct and documented aiohttp behavior.

So the question now really is why does requests work?

Separately, it should be possible for us to subclass aiohttp's ClientSession to pass per-host basicauth so it can provide appropriate auth to different hosts in the chain, and just send basic auth to earthdata.

yuvipanda commented 1 year ago

ok, so I have discovered why it works with requests but not with aiohttp.

It is because requests supports netrc lol!

So at the first redirect, requests drops the Authorization header, but when making the request to EDL, it reads netrc file directly and sends the appropriate credentials! So that is why it works by default with requests, and not with aiohttp.

So to summarize, the current problem is that we pass parameters to fsspec that are set at the ClientSession level, and those are sent with every request. So the Authorization header is also sent when making the request to S3, and it fails. This is validated with the following code:

import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

auth = aiohttp.BasicAuth(username, password)

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get(url, auth=auth) as response:
            print(response.status)
            print((await response.read())[:30])

asyncio.run(main())

This actually will fail with a HTTP Basic request denied error anywhere, which makes sense - the Authorization header is dropped at the first redirect to EDL, and then we get an access denied.

If I recreate this with requests by deleting my netrc file:

import requests
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username = "yuvipanda"
password = "mypassword"
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])

I get the exact same behavior.

WHICH IS GREAT! So the problem now isn't to do with redirects at all, it is really - how do we make sure to send the HTTP Basic Creds just to EDL? Because right now, the reason this works with non-cloud datasets is that we are actually leaking plaintext EDL creds to all of them, completely negating the point of OAuth2 :D

yuvipanda commented 1 year ago

I see trust_env passed along to the aiohttp session, but aiohttp only uses this for proxies, not for authenticating to servers themselves.

yuvipanda commented 1 year ago

So the current issue is really that aiohttp has no way to say 'for this domain, send this authentication information'. requests accidentally provides this with netrc, but otherwise doesn't afaict.

yuvipanda commented 1 year ago

So, netrc support is actually the easiest way to make sure that we can send specific Basic Auth credentials only to specific Hosts. So I made this PR adding it to aiohttp! https://github.com/aio-libs/aiohttp/pull/7131

If merged and released, this should sort of automatically make fsspec work again.

betolink commented 1 year ago

Amazing work @yuvipanda! I'm just catching up with this thread. One thing I'd like to mention is that -if possible- it would be preferable to have a solution/workaround that does not rely on having a .netrc (even thought is what we been doing for the tutorials).

yuvipanda commented 1 year ago

@betolink so I think these tokens (https://urs.earthdata.nasa.gov/documentation/for_users/user_token) should get rid of the need for netrc completely. I have no idea why people are restricted to just two tokens per user - that makes it definitely harder to use :(

yuvipanda commented 1 year ago

I dug some more into what fsspec would need to do for us to use client tokens.

fsspec currently supports a client_kwargs that allows setting headers and other misc options for all requests. This accidentally works now when making requests behind EDL from outside us-west-2, but doesn't work from inside (for all the reasons outlined in this issue). So we can not use the auth tokens with it either.

What we need is something like request_kwargs (that is passed into places like https://github.com/fsspec/filesystem_spec/blob/45de5b509bacf8a62d99848bb2361cc78733ad09/fsspec/implementations/http.py#L242 and everywhere else requests are constructed). This allows these params to be set just for the originating request, but not for any follow-on redirects from there. This wouldn't help when using username / password for EDL (as the username / password needs to be sent for a request along the redirect path, not the originating request), but would work for using tokens (as they must be only sent to the originating request).

I think this is a fairly well scoped and small change to fsspec that would be extremely useful! I'm super swamped though, I am hoping someone else can implement this?

yuvipanda commented 1 year ago

Opened https://github.com/fsspec/filesystem_spec/issues/1142 to discuss what would help solve the issue from fsspec in allowing us to use tokens!

yuvipanda commented 1 year ago

Turns out this already exists in fsspec - any kwargs you pass in actually get passed directly to the requests, exactly what we wanted!

So the following code works for me :)

from fsspec.implementations.http import HTTPFileSystem
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

token = 'my-long-token'
fs = HTTPFileSystem(headers={
    "Authorization": f"Bearer {token}"
})

with fs.open(url) as f:
    print(f.read()[:30])

yay!

yuvipanda commented 1 year ago

ok, so current summary is:

https://github.com/aio-libs/aiohttp/pull/7131 adds .netrc support to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login.
However, I think we should recommend everyone use tokens for actually authenticating programmatically - https://urs.earthdata.nasa.gov/documentation/for_users/user_token. This already works with fsspec - just pass headers as a kwargs as shown in the comment above, rather than as a part of client_kwargs. yay!

Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.

yuvipanda commented 1 year ago

Here is an example of it working with xarray!

from fsspec.implementations.http import HTTPFileSystem
import xarray as xr

url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

token = 'my-long-token'
fs = HTTPFileSystem(headers={
    "Authorization": f"bearer {token}"
})
ds = xr.open_dataset(fs.open(url))
ds

betolink commented 1 year ago

This is awesome @yuvipanda! I feel like we need to refactor this library to only use CMR tokens everywhere instead of monkey-patching OAuth2 redirects for cloud-hosted data. I wish DAAC hosted data would follow the same behavior with bearer tokens. i.e.

# bearer token for the win with cloud hosted data !!
# url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

# =( bearer token? don't know him.
url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2019.02.21/ATL08_20190221121851_08410203_005_01.h5"

Also, maybe we only need one token even if we use it concurrently from different processes? I haven't tested but I suspect it should work.

yuvipanda commented 1 year ago

@betolink yeah we should only need one token even if it is used concurrently.

So the token only works for some datasets but not all? And works for cloud datasets but not on-prem? Does it work for any on prem thing at all?

betolink commented 1 year ago

I'm afraid it won't work for on-prem data, it may work for some data hosted at the ASF DAAC marked on-prem but actually hosted at AWS.

This is tremendous progress! Now there is a clear path for one of the most common access patterns!

yuvipanda commented 1 year ago

@betolink feels like long term, the right way is to get the access token to work for all data, and support the earthdatalogin folks in this misison. In the meantime, netrc is the more universal solution, once we get the aiohttp pr merged. But that is slightly messy when it comes to dask, because it requires populating a specific file in the dask worker which is not always easy. Does that sound right?

yuvipanda commented 1 year ago

Me and @briannapagan did another bit of deep dive here, and made some more progress.

There seem to be two primary packages supporting earthdata login on the server side:

TEA (https://github.com/asfadmin/thin-egress-app/) - this is what does the work for cloud hosted data
An apache2 module (https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/browse) which seems to be used by most on-prem datacenters. It also seems to provide authentication for most OpenDAP servers (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2).

We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!

As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named URSFOUR-1600 that also tracks this feature.

With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying 'some apache change' to help with that. So the hypothesis I had was:

LPDAAC ran into some other unrelated issue,
Which required code changes to the apache module, which was done via URSFOUR-1858
They have deployed this change to their servers
However, since this change was deployed , it is also likely that LPDAAC has included URS-1600 (user token support) in the deployment as well. Not necessarily explicitly, but just as a side effect of trying to deploy the more recent URSFOUR-1858.

I tested this hypothesis by trying to send a token to https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5 - a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens :)

So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.

This is great news for many reasons:

No new code needs to be written! This all is already done.
LPDAAC already deployed this, so it isn't a brand new deployment
This is the official apache module that DAACs are already using, not some newfangled new software.

yuvipanda commented 1 year ago

Also, passing -v to curl will send you back the response headers, which usually contain < Server: Apache to indicate they are using the apache2 server - and hence most likely 'on-prem' (aka not coming from S3)

yuvipanda commented 1 year ago

NSIDC also seems to have the latest version of the apache module - https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2020.03.08/ATL06_20200308234154_11190602_005_01.h5 works with the token!

So looks like some (many?) DAACs have this deployed, and some don't.

yuvipanda commented 1 year ago

@betolink in fact, the exact URL you used to test tokens earlier in https://github.com/nsidc/earthaccess/issues/188#issuecomment-1364042546 works now. My suspicion is that NSIDC deployed the latest version of the apache2 module very recently?

yuvipanda commented 1 year ago

ASDC also supports tokens, as tested with https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_VFM-Standard-V4-20/2010/09/CAL_LID_L2_VFM-Standard-V4-20.2010-09-01T00-14-43ZN.hdf.

Again, I'm using the presence of Server: apache to distinguish on-prem vs S3 hosted data. I think it's reasonably accurate.

yuvipanda commented 1 year ago

ORNL also supports it, as tested via https://daac.ornl.gov/daacdata/deltax/DeltaX_Ecogeomorphic_Products/data/DeltaX_EcoGeoCells_2021_TerrebonneEast_std_superpixels.tif.

yuvipanda commented 1 year ago

Note that uppercase Bearer is what I'm using, as that's what the apache module supports (see line 684 in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187#mod_auth_urs.c).

yuvipanda commented 1 year ago

podac (tested with 'https://podaac-tools.jpl.nasa.gov/drive/files/allData/topex/L1B/altsdr/001/altsdr001052.txt) and SEDAC (tested with https://sedac.ciesin.columbia.edu/downloads/data/urbanspatial/urbanspatial-urban-land-backscatter-time-series-1993-2020/urbanspatial-urban-land-backscatter-time-series-1993-2020-seasonal-urban-netcdf.zip) don't have the latest either.

yuvipanda commented 1 year ago

This issue has spawned off many different things, so here's a quick summary:

1. Support using HTTPS + .netrc with xarray universally

Currently, it is not possible to use .netrc files with xarray if you are running from inside us-west-2. So if you are inside us-west-2 and want to access cloud hosted data with xarray, you must use S3 (not plain HTTPS). Once https://github.com/aio-libs/aiohttp/pull/7131 lands and a new release of aiohttp is made, this issue will go away. So code that uses HTTPS+netrc will universally work, regardless of it being in us-west-2 or elsewhere.

2. Support using EDL user tokens universally

For cloud access, EDL tokens already work with xarray (https://github.com/nsidc/earthaccess/issues/188#issuecomment-1363450269 has an example). However, it doesn't work universally - many on-prem servers don't support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help!

3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2

s3:// links only work from inside us-west-2, so we should have clear documentation on when users should use the s3:// protocol vs just https. From inside us-west-2, there could be a performance difference between these two, but my personal intuition is that it is not significant enough for man use cases, especially beginner cases. This is the part that least amount of work has been done on so far. We would need some test cases testing s3 vs https from inside us-west-2 to establish this performance difference.

End goal for education

My intuitive end goal here is to be able to tell people to 'use HTTPS links with token auth' universally, regardless of where they are accessing the data from, with an addendum suggesting using the s3:// protocol under specific performance circumstances. A step along the way is to be able to tell people to use HTTPS links with netrc universally.

yuvipanda commented 1 year ago

An addendum to https://github.com/nsidc/earthaccess/issues/188#issuecomment-1371626230 that me and @briannapagan discovered is OpenDAP, offered mostly by the Hyrax server. It also uses the apache module for authentication (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2), regardless of wether it is on-prem or on the cloud. So my understanding is that all opendap behind earthdata is using the apache module, so those would also need the module updated to support the token. This also means that the apache module is going to be with us for a long time, not just for on-prem work, as it is used for cloud hosted opendap too.

briannapagan commented 1 year ago

Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following: curl -H "Authorization: Bearer TOKEN" -L --url ‘URL’ >out

ashiklom commented 1 year ago

May not be the right thread, but dropping a note here so it's more permanent than Slack:

A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork's S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL.

https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ

Here's a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79

From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP.

briannapagan commented 1 year ago

Bringing in @cisaacstern to maybe provide some extra feedback to Alexey's last message.

cisaacstern commented 1 year ago

Happy to contribute however I can! We do currently use an OSN allocation as our default storage target for Pangeo Forge.

alexgleith commented 1 year ago

Do I have to be in the AWS us-west-2 region to access data direct from S3?

Seems so, as my notebook that doesn't work from my laptop does work when running in an EC2 instance in Oregon...

andypbarrett commented 1 year ago

Hi Alex,

yes, you must have an EC2 instance running in the same region as the S3 bucket (us-west-2 for NASA data) to "Directly Access" the data.

Andy Barrett

alexgleith commented 1 year ago

Ok, I got it working using @yuvipanda's code above.

Needs a little fix in a related project, which I've raised as a PR: https://github.com/nasa/EMIT-Data-Resources/pull/24

betolink commented 1 year ago

@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:

Speed: when we use HTTPS we are going through NASA's CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).
Chunking affecting performance: related to the first point, if a file is chunked into hundreds of chunks, each will result on a separate HTTPS request that has to go through the proxy and some datasets will be slower to access than others because of this.

Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC.

import earthaccess

earthaccess.login()
url = "s3://some_nasa_dataset"
fs = earthaccess.get_s3fs_session("LPDAAC")

# we open our granule in a s3fs context and we work as usual
with fs.open(url) as file:
    dataset = xr.open_dataset(file)

alexgleith commented 1 year ago

Thanks @betolink

For the work I'm doing, it's exploratory so performance isn't important yet. And I don't think that for the NetCDF files chunking matters, since they're not optimised for it. (Happy to be corrected there!)

I'm just doing a little project on the EMIT data and there's enough complexity in the data itself that I'm happy with the HTTPS loading process. Thanks for your help!

weiji14 commented 1 year ago

Speed: when we use HTTPS we are going through NASA's CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).

Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year's worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There's about a 0.25 seconds speed advantage (8.08s with s3, 7.74s with https) which is fairly small, but if earthaccess can handle switching between s3/https based on the compute region, that would be awesome!

betolink commented 1 year ago

This is great @weiji14! just this week @yuvipanda and I were talking about this and the pros and cons of defaulting to HTTPS, earthaccess handles the switch already, if it's running in AWS will use S3 and HTTPs if not, it does it by requesting the instance metadata on an IP range only available inside AWS (although it does not check the region yet)https://github.com/nsidc/earthaccess/blob/54b688b906776f5c845483dd00676f6c681feb10/earthaccess/store.py#LL67C36-L67C36

if we request data like

granules = earthdata.search_data(...)
ds = xr.open_mfdatasets(earthaccess.open(granules))

and run this code in AWS, it will use the S3 links and S3FS to open them. On a related issue... I still notice a lot of latency when we try to open files even in region(compared to just download them to our EC2 instance), something that needs to be further documented. In this example with stack_stac I'm not sure if under the hood they use S3FS or not.

yuvipanda commented 8 months ago

Just to note, aiohttp finally made a release! So fsspec now supports netrc correctly!

mfisher87 commented 8 months ago

Sweet! Should we add a pin for aiohttp and mark this resolved @betolink ?

asteiker commented 2 weeks ago

@yuvipanda @betolink Can this be fully closed out now? Do we still need to pin this? We weren't seeing this referenced in https://github.com/nsidc/earthaccess/blob/main/pyproject.toml

nsidc / earthaccess

Document why signed S3 URLs might be giving 400s when called from inside us-west-2 #188

1. Support using HTTPS + .netrc with xarray universally

2. Support using EDL user tokens universally

3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2

End goal for education