Open yuvipanda opened 1 year ago
You actually get a 400, and here is the smallest sample case:
import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
auth = aiohttp.BasicAuth(username, password)
async def main():
async with aiohttp.ClientSession(auth=auth) as session:
async with session.get(url) as response:
print(response.status)
print((await response.read())[:30])
asyncio.run(main())
When running from inside us-west-2
, this prints:
400
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code><Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>Basic eXV2aXBhbmRhOmFpc2hlZTh3b29naGFobmdpZW1vb3Nob0thaXhpaWJl</ArgumentValue><RequestId>XM26KTSJ4X85W6YR</RequestId><HostId>gjjlJGJmgjalTBXzAnnMg4eBl2MCd3k9UD4klvAO3Rjd18TOB3QCgDC3bAMwciPyIRrStqrD4SQ=</HostId></Error>
which is pretty clear and useful!
And on my laptop, this prints:
200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08\x00\x04\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
So we have a reproducible setup now. fsspec uses aiohttp under the hood, so this is the same issue fsspec is facing
This is likely the aiohttp bug actually: https://github.com/aio-libs/aiohttp/issues/2610
Here is the same code with requests:
import requests
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])
This actually produces the correct output on both my laptop and on us-west-2!
200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08'
This is most likely because requests implemented https://github.com/request/request/pull/1184, while the equivalent bug with aiohttp is still open.
This is amazing news, as this means that fixing https://github.com/aio-libs/aiohttp/issues/2610 should get fsspec to work, which means most of the pangeo stack would work after that. It will still have lower performance than using s3 directly when in us-west-2, so work there still needs to be done. But this will at least make sure regular https URLs work when both inside and outside us-west-2
aiohttp has documented this should not be the case, based on the note here: https://docs.aiohttp.org/en/stable/client_advanced.html?highlight=redirects#custom-request-headers
I also looked at the request being made by aiohttp, and see the following:
RequestInfo(url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf6'), method='GET', headers=<CIMultiDictProxy('Host': 'nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.9 aiohttp/3.8.3', 'Authorization': 'Basic <removed>')>, real_url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf
'))
So I think this confirms that the Authorization header is being retained during redirects.
I've now found @betolink's comment in https://github.com/aio-libs/aiohttp/issues/5783#issuecomment-981958210, and made me realize that what we want is for the credentials to be forwarded when we are redirected to earthdata login, but then dropped. But what we are getting is instead it being sent to everything
AHA, so what's actually happening is that we are setting the basic auth on the session, rather than on the request. So it's being sent to every request from the session, including S3! This actually now is unrelated to the aiohttp bug
if I move the auth=
to just the request, I get a basic 401 denied, as the Basic auth is dropped during the redirect, which is correct and documented aiohttp behavior.
So the question now really is why does requests work?
Separately, it should be possible for us to subclass aiohttp's ClientSession to pass per-host basicauth so it can provide appropriate auth to different hosts in the chain, and just send basic auth to earthdata.
ok, so I have discovered why it works with requests but not with aiohttp.
It is because requests supports netrc lol!
So at the first redirect, requests drops the Authorization header, but when making the request to EDL, it reads netrc file directly and sends the appropriate credentials! So that is why it works by default with requests, and not with aiohttp.
So to summarize, the current problem is that we pass parameters to fsspec that are set at the ClientSession
level, and those are sent with every request. So the Authorization
header is also sent when making the request to S3, and it fails. This is validated with the following code:
import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
auth = aiohttp.BasicAuth(username, password)
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url, auth=auth) as response:
print(response.status)
print((await response.read())[:30])
asyncio.run(main())
This actually will fail with a HTTP Basic request denied error anywhere, which makes sense - the Authorization
header is dropped at the first redirect to EDL, and then we get an access denied.
If I recreate this with requests
by deleting my netrc
file:
import requests
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username = "yuvipanda"
password = "mypassword"
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])
I get the exact same behavior.
WHICH IS GREAT! So the problem now isn't to do with redirects at all, it is really - how do we make sure to send the HTTP Basic Creds just to EDL? Because right now, the reason this works with non-cloud datasets is that we are actually leaking plaintext EDL creds to all of them, completely negating the point of OAuth2 :D
I see trust_env
passed along to the aiohttp session, but aiohttp only uses this for proxies, not for authenticating to servers themselves.
So the current issue is really that aiohttp has no way to say 'for this domain, send this authentication information'. requests accidentally provides this with netrc, but otherwise doesn't afaict.
So, netrc support is actually the easiest way to make sure that we can send specific Basic Auth credentials only to specific Hosts. So I made this PR adding it to aiohttp! https://github.com/aio-libs/aiohttp/pull/7131
If merged and released, this should sort of automatically make fsspec work again.
Amazing work @yuvipanda! I'm just catching up with this thread. One thing I'd like to mention is that -if possible- it would be preferable to have a solution/workaround that does not rely on having a .netrc
(even thought is what we been doing for the tutorials).
@betolink so I think these tokens (https://urs.earthdata.nasa.gov/documentation/for_users/user_token) should get rid of the need for netrc completely. I have no idea why people are restricted to just two tokens per user - that makes it definitely harder to use :(
I dug some more into what fsspec
would need to do for us to use client tokens.
fsspec
currently supports a client_kwargs that allows setting headers and other misc options for all requests. This accidentally works now when making requests behind EDL from outside us-west-2, but doesn't work from inside (for all the reasons outlined in this issue). So we can not use the auth tokens with it either.
What we need is something like request_kwargs
(that is passed into places like https://github.com/fsspec/filesystem_spec/blob/45de5b509bacf8a62d99848bb2361cc78733ad09/fsspec/implementations/http.py#L242 and everywhere else requests are constructed). This allows these params to be set just for the originating request, but not for any follow-on redirects from there. This wouldn't help when using username / password for EDL (as the username / password needs to be sent for a request along the redirect path, not the originating request), but would work for using tokens (as they must be only sent to the originating request).
I think this is a fairly well scoped and small change to fsspec that would be extremely useful! I'm super swamped though, I am hoping someone else can implement this?
Opened https://github.com/fsspec/filesystem_spec/issues/1142 to discuss what would help solve the issue from fsspec in allowing us to use tokens!
Turns out this already exists in fsspec - any kwargs you pass in actually get passed directly to the requests, exactly what we wanted!
So the following code works for me :)
from fsspec.implementations.http import HTTPFileSystem
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
token = 'my-long-token'
fs = HTTPFileSystem(headers={
"Authorization": f"Bearer {token}"
})
with fs.open(url) as f:
print(f.read()[:30])
yay!
ok, so current summary is:
.netrc
support to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login.fsspec
- just pass headers
as a kwargs as shown in the comment above, rather than as a part of client_kwargs
. yay!Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.
Here is an example of it working with xarray!
from fsspec.implementations.http import HTTPFileSystem
import xarray as xr
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
token = 'my-long-token'
fs = HTTPFileSystem(headers={
"Authorization": f"bearer {token}"
})
ds = xr.open_dataset(fs.open(url))
ds
This is awesome @yuvipanda! I feel like we need to refactor this library to only use CMR tokens everywhere instead of monkey-patching OAuth2 redirects for cloud-hosted data. I wish DAAC hosted data would follow the same behavior with bearer tokens. i.e.
# bearer token for the win with cloud hosted data !!
# url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
# =( bearer token? don't know him.
url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2019.02.21/ATL08_20190221121851_08410203_005_01.h5"
Also, maybe we only need one token even if we use it concurrently from different processes? I haven't tested but I suspect it should work.
@betolink yeah we should only need one token even if it is used concurrently.
So the token only works for some datasets but not all? And works for cloud datasets but not on-prem? Does it work for any on prem thing at all?
I'm afraid it won't work for on-prem data, it may work for some data hosted at the ASF DAAC marked on-prem but actually hosted at AWS.
This is tremendous progress! Now there is a clear path for one of the most common access patterns!
@betolink feels like long term, the right way is to get the access token to work for all data, and support the earthdatalogin folks in this misison. In the meantime, netrc is the more universal solution, once we get the aiohttp pr merged. But that is slightly messy when it comes to dask, because it requires populating a specific file in the dask worker which is not always easy. Does that sound right?
Me and @briannapagan did another bit of deep dive here, and made some more progress.
There seem to be two primary packages supporting earthdata login on the server side:
We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!
As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named URSFOUR-1600
that also tracks this feature.
With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying 'some apache change' to help with that. So the hypothesis I had was:
I tested this hypothesis by trying to send a token to https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5
- a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens :)
So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.
This is great news for many reasons:
Also, passing -v
to curl
will send you back the response headers, which usually contain < Server: Apache
to indicate they are using the apache2 server - and hence most likely 'on-prem' (aka not coming from S3)
NSIDC also seems to have the latest version of the apache module - https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2020.03.08/ATL06_20200308234154_11190602_005_01.h5
works with the token!
So looks like some (many?) DAACs have this deployed, and some don't.
@betolink in fact, the exact URL you used to test tokens earlier in https://github.com/nsidc/earthaccess/issues/188#issuecomment-1364042546 works now. My suspicion is that NSIDC deployed the latest version of the apache2 module very recently?
ASDC also supports tokens, as tested with https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_VFM-Standard-V4-20/2010/09/CAL_LID_L2_VFM-Standard-V4-20.2010-09-01T00-14-43ZN.hdf
.
Again, I'm using the presence of Server: apache
to distinguish on-prem vs S3 hosted data. I think it's reasonably accurate.
ORNL also supports it, as tested via https://daac.ornl.gov/daacdata/deltax/DeltaX_Ecogeomorphic_Products/data/DeltaX_EcoGeoCells_2021_TerrebonneEast_std_superpixels.tif
.
Note that uppercase Bearer
is what I'm using, as that's what the apache module supports (see line 684 in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187#mod_auth_urs.c).
podac (tested with 'https://podaac-tools.jpl.nasa.gov/drive/files/allData/topex/L1B/altsdr/001/altsdr001052.txt
) and SEDAC (tested with https://sedac.ciesin.columbia.edu/downloads/data/urbanspatial/urbanspatial-urban-land-backscatter-time-series-1993-2020/urbanspatial-urban-land-backscatter-time-series-1993-2020-seasonal-urban-netcdf.zip
) don't have the latest either.
This issue has spawned off many different things, so here's a quick summary:
Currently, it is not possible to use .netrc
files with xarray if you are running from inside us-west-2. So if you are inside us-west-2 and want to access cloud hosted data with xarray, you must use S3 (not plain HTTPS). Once https://github.com/aio-libs/aiohttp/pull/7131 lands and a new release of aiohttp
is made, this issue will go away. So code that uses HTTPS+netrc will universally work, regardless of it being in us-west-2 or elsewhere.
For cloud access, EDL tokens already work with xarray (https://github.com/nsidc/earthaccess/issues/188#issuecomment-1363450269 has an example). However, it doesn't work universally - many on-prem servers don't support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help!
s3://
links only work from inside us-west-2, so we should have clear documentation on when users should use the s3:// protocol vs just https. From inside us-west-2, there could be a performance difference between these two, but my personal intuition is that it is not significant enough for man use cases, especially beginner cases. This is the part that least amount of work has been done on so far. We would need some test cases testing s3 vs https from inside us-west-2 to establish this performance difference.
My intuitive end goal here is to be able to tell people to 'use HTTPS links with token auth' universally, regardless of where they are accessing the data from, with an addendum suggesting using the s3://
protocol under specific performance circumstances. A step along the way is to be able to tell people to use HTTPS links with netrc
universally.
An addendum to https://github.com/nsidc/earthaccess/issues/188#issuecomment-1371626230 that me and @briannapagan discovered is OpenDAP, offered mostly by the Hyrax server. It also uses the apache module for authentication (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2), regardless of wether it is on-prem or on the cloud. So my understanding is that all opendap behind earthdata is using the apache module, so those would also need the module updated to support the token. This also means that the apache module is going to be with us for a long time, not just for on-prem work, as it is used for cloud hosted opendap too.
Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following:
curl -H "Authorization: Bearer TOKEN" -L --url ‘URL’ >out
May not be the right thread, but dropping a note here so it's more permanent than Slack:
A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork's S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL.
https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ
Here's a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79
From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP.
Bringing in @cisaacstern to maybe provide some extra feedback to Alexey's last message.
Happy to contribute however I can! We do currently use an OSN allocation as our default storage target for Pangeo Forge.
Do I have to be in the AWS us-west-2
region to access data direct from S3?
Seems so, as my notebook that doesn't work from my laptop does work when running in an EC2 instance in Oregon...
Hi Alex,
yes, you must have an EC2 instance running in the same region as the S3 bucket (us-west-2
for NASA data) to "Directly Access" the data.
Andy Barrett
Ok, I got it working using @yuvipanda's code above.
Needs a little fix in a related project, which I've raised as a PR: https://github.com/nasa/EMIT-Data-Resources/pull/24
@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:
Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC.
import earthaccess
earthaccess.login()
url = "s3://some_nasa_dataset"
fs = earthaccess.get_s3fs_session("LPDAAC")
# we open our granule in a s3fs context and we work as usual
with fs.open(url) as file:
dataset = xr.open_dataset(file)
Thanks @betolink
For the work I'm doing, it's exploratory so performance isn't important yet. And I don't think that for the NetCDF files chunking matters, since they're not optimised for it. (Happy to be corrected there!)
I'm just doing a little project on the EMIT data and there's enough complexity in the data itself that I'm happy with the HTTPS loading process. Thanks for your help!
- Speed: when we use HTTPS we are going through NASA's CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).
Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year's worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There's about a 0.25 seconds speed advantage (8.08s with s3
, 7.74s with https
) which is fairly small, but if earthaccess
can handle switching between s3/https based on the compute region, that would be awesome!
This is great @weiji14! just this week @yuvipanda and I were talking about this and the pros and cons of defaulting to HTTPS, earthaccess handles the switch already, if it's running in AWS will use S3 and HTTPs if not, it does it by requesting the instance metadata on an IP range only available inside AWS (although it does not check the region yet)https://github.com/nsidc/earthaccess/blob/54b688b906776f5c845483dd00676f6c681feb10/earthaccess/store.py#LL67C36-L67C36
if we request data like
granules = earthdata.search_data(...)
ds = xr.open_mfdatasets(earthaccess.open(granules))
and run this code in AWS, it will use the S3 links and S3FS to open them. On a related issue... I still notice a lot of latency when we try to open files even in region(compared to just download them to our EC2 instance), something that needs to be further documented. In this example with stack_stac I'm not sure if under the hood they use S3FS or not.
Just to note, aiohttp finally made a release! So fsspec
now supports netrc
correctly!
Sweet! Should we add a pin for aiohttp and mark this resolved @betolink ?
@yuvipanda @betolink Can this be fully closed out now? Do we still need to pin this? We weren't seeing this referenced in https://github.com/nsidc/earthaccess/blob/main/pyproject.toml
Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.
However, sometimes, in some clients, you get a generic
403 Forbidden
here without much explanation. It has something to do with other auth being sent alongside (see https://github.com/nsidc/earthaccess/issues/187 for more vague info).We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.