Open scottyhq opened 4 years ago
Since there will be lots of valuable data like this that is not in a cloud-optimized data store and format, I think it makes sense to have a download()
method that can pick up ~/.netrc
(equivalent to wget https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc
). For this particular example, it is then up to a user to load into xarray from the local file:
import xarray as xr
localFile = 'S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'
da = xr.open_dataset(localFile,
group='/science/grids/data')
da
thoughts @matthewhanson @jhamman @apawloski @martindurant ?
What is contained in the .netcdf file, is it user/password for the HTTP call?
In general, you can use fsspec.open_local and a URL containing caching (or a local path), and get an experience on par with other fsspec operations. Parallel downloading of multiple files should not be far off either.
What is contained in the .netcdf file, is it user/password for the HTTP call?
cat ~/.netrc looks like this:
machine urs.earthdata.nasa.gov login MYUSERNAME password MYPASSWORD
It looks like the requests
library automatically picks up this file (code block below works). There is even a standard library module for reading it (https://docs.python.org/3/library/netrc.html)! but I'm unsure how to get fsspec to read it / pass username and password to HTTPFileSystem
url = item['data'].urlpath
with open('test.nc', 'wb') as f:
resp = requests.get(url)
f.write(resp.content)
fsspec uses aiohttp, not requests, so maybe that's why it's not getting picked up automatically. In this case, it should work like
(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
of = fsspec.open(url, "rb", auth=(username, password)})
with of as f:
...
or
fs = fsspec.filesystem("http") # can include auth here for all URLs, or specify with open
f = fs.open(url, "rb", auth=(username, password))
Actually, after a little reading, it seems that aiohttp does support this, if the client is passed trust_env=True
(see https://github.com/aio-libs/aiohttp/pull/2584 ), but there is no way to get this arg to the client in fsspec right now. It would be easy to add (client_kwargs=None
, for example, as done for s3fs), if someone is willing.
related PR over in sat-stac https://github.com/sat-utils/sat-stac/pull/62
Hi @martindurant - after trying a few other approaches to see how this works behind the scenes I'm a bit confused.
The following code works using aiohttp directly:
import aiohttp
url = item['data'].urlpath
auth=aiohttp.BasicAuth(username,password)
async with aiohttp.ClientSession(auth=auth) as session:
async with session.get(url) as resp:
print(resp.status)
with open('local.nc', 'wb') as f:
f.write(await resp.read())
I can't seem to get the ~/.netrc picked up, reading the PR you linked to and docs, maybe there is a separate workflow dealing with proxies that this gets into, because the following returns 401 Unauthorized Basic realm="Please enter your Earthdata Login credentials
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.get(url) as resp:
print(resp.text)
with open('local.nc', 'wb') as f:
f.write(await resp.read())
If I use fssepc as you suggested with the following i get a FileNotFoundError
(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
auth=(username,password)
of = fsspec.open(url, "rb", auth=(username, password))
with of as remote:
with open('local.nc', 'wb') as local:
local.write(remote.read())
Finally, I thought this might work, but I get a ClientResponseError
fs = fsspec.filesystem("http", auth=aiohttp.BasicAuth(username,password))
with fs.open(url, "rb") as remote:
with open('local.nc', 'wb') as local:
local.write(remote.read())
Interestingly for the last case, the traceback provides a link that if I click on in my browser the download works!?
ClientResponseError: 401, message='Unauthorized', url=URL('https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=iwntGSgHy9yoog7Mjag0dQ&response_type=code&redirect_uri=https://grfn.asf.alaska.edu/door/oauth&state=aHR0cDovL2dyZm4uYXNmLmFsYXNrYS5lZHUvZG9vci9kb3dubG9hZC9TMS1HVU5XLUEtUi0wODctdG9wcy0yMDE0MTAyM18yMDE0MTAxMS0xNTM4NTYtMjc1NDVOXzI1NDY0Ti1QUC0xYTFhLXYyXzBfMi5uYw')
Could you please advise on how to use fsspec directly? And where would be best to implement the reading of credentials (intake,fsspec,aiohttp,intake-stac?) from ~/.netrc so that a user doesn't have to write code to load them?
And where would be best to implement the reading of credentials
The HttpFileSystem ought to have an option, so that you can pass the trust_env parameter - although it seems maybe that isn't working for you. I've never heard of .netrc before, but it doesn't sound stac-specific. If we can't get aiohttp to find and use it automatically, then fsspec would be the place to handle it.
Is there any chance you can share some creds privately so that I can test what works?
thanks for you help @martindurant !
there are definitely two things to figure out: 1) how to correctly pass username and password explicitly to httpfilesystem (the last code block seems close!) and 2) getting the netrc read correctly behind the scenes.
I can send you creds via keybase or however you prefer, it's also easy to register (https://urs.earthdata.nasa.gov/home) this is NASA's standard login which anyone can sign up for w/ some basic info.
OK, I can sign up - but I won't get to this until next week now.
It turns out, if you manually follow the redirect - i.e., apply the auth again to the generated URL - you can get the file. I feel like I'm getting somewhere.
With https://github.com/intake/filesystem_spec/pull/400 , you can do
fs = fsspec.filesystem('http', client_kwargs={'auth': aiohttp.BasicAuth('mdurant', 'xx')})
with fs.open(url) as f:
f.read()
I don't know why passing in the open
kwargs or putting in .netrc isn't working, even with trust_env=True
Thanks @martindurant !
I don't know why passing in the open kwargs or putting in .netrc isn't working, even with trust_env=True
There definitely is something odd with how aiohttp handles the netrc auth. Short of opening an issue upstream, I'm wondering if in fsspec we could have an option that generates the aiohttp.BasicAuth
from a netrc. For example fs = fsspec.filesystem('http', netrc_auth="urs.earthdata.nasa.gov")
I'm still unclear about how to get this into intake-stac
as well. Seems like some sort of auth arguments need to be accepted here https://github.com/intake/intake-stac/blob/0fcde70ea04ac96b5909e027bd6f513064fbf042/intake_stac/catalog.py#L15, which get passed down the chain. For example:
from intake import open_stac_catalog
catalog_url = 'https://raw.githubusercontent.com/cholmes/sample-stac/master/stac/catalog.json'
cat = open_stac_catalog(catalog_url, netrc_auth="urs.earthdata.nasa.gov")
Such that whenever a user opens a file, the auth settings are in place:
item = catalog['myitem']
da = item['data'].to_dask()
Seem like it needs to migrate to this ilne, where we know the URL, and can do the login lookup. That should be the default, but probably the user should be able to override.
The .netrc
authentication steps in Remote NetCDF + Authentication
from this example worked for me:
https://github.com/intake/intake-stac/blob/master/examples/intake-cmr-stac.ipynb
As part of STAC-sprint 6 I was trying out intake-stac with https://github.com/nasa/cmr-stac. It would be absolutely amazing to integrate intake-stac with that endpoint to facilitate working with NASA datasets! But there multiple things to work out. First and foremost is how to deal with Authentication.
Unlike boto3 cloud credentials, NASA uses and 'Earthdata login' (https://urs.earthdata.nasa.gov/documentation). Typically, science users keep their username and password in a ~/.netrc file for any time you try to retrieve a file. This mechanism doesn't currently work with the intake-stac .to_dask() method. For example:
Leads to a big traceback:
Full example here: https://gist.github.com/scottyhq/04fe1e2d0b946b97228f6922cf001bbd