intake / intake-stac

Intake interface to STAC data catalogs
https://intake-stac.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
110 stars 25 forks source link

Intake-STAC with NASA CMR STAC proxy: Authentication #60

Open scottyhq opened 4 years ago

scottyhq commented 4 years ago

As part of STAC-sprint 6 I was trying out intake-stac with https://github.com/nasa/cmr-stac. It would be absolutely amazing to integrate intake-stac with that endpoint to facilitate working with NASA datasets! But there multiple things to work out. First and foremost is how to deal with Authentication.

Unlike boto3 cloud credentials, NASA uses and 'Earthdata login' (https://urs.earthdata.nasa.gov/documentation). Typically, science users keep their username and password in a ~/.netrc file for any time you try to retrieve a file. This mechanism doesn't currently work with the intake-stac .to_dask() method. For example:

item['data'].metadata
#{'href': 'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'}
da = item['data'].to_dask()

Leads to a big traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    197             try:
--> 198                 file = self._cache[self._key]
    199             except KeyError:

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-15-90d7a2a112b8> in <module>
----> 1 da = item['data'].to_dask()

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     56             _open_dataset = xr.open_dataset
     57 
---> 58         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
     59 
     60     def _add_path_to_ds(self, ds):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    507         if engine == "netcdf4":
    508             store = backends.NetCDF4DataStore.open(
--> 509                 filename_or_obj, group=group, lock=lock, **backend_kwargs
    510             )
    511         elif engine == "scipy":

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    356             netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    357         )
--> 358         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    359 
    360     def _acquire(self, needs_lock=True):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose)
    312         self._group = group
    313         self._mode = mode
--> 314         self.format = self.ds.data_model
    315         self._filename = self.ds.filepath()
    316         self.is_remote = is_remote_uri(self._filename)

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in ds(self)
    365     @property
    366     def ds(self):
--> 367         return self._acquire()
    368 
    369     def open_store_variable(self, name, var):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock)
    359 
    360     def _acquire(self, needs_lock=True):
--> 361         with self._manager.acquire_context(needs_lock) as root:
    362             ds = _nc4_require_group(root, self._group, self._mode)
    363         return ds

~/miniconda3/envs/intake-stac-gui/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    184     def acquire_context(self, needs_lock=True):
    185         """Context manager for acquiring a file."""
--> 186         file, cached = self._acquire_with_cache_info(needs_lock)
    187         try:
    188             yield file

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    202                     kwargs = kwargs.copy()
    203                     kwargs["mode"] = self._mode
--> 204                 file = self._opener(*self._args, **kwargs)
    205                 if self._mode == "w":
    206                     # ensure file doesn't get overriden when opened again

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -78] NetCDF: Authorization failure: b'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'

Full example here: https://gist.github.com/scottyhq/04fe1e2d0b946b97228f6922cf001bbd

scottyhq commented 4 years ago

Since there will be lots of valuable data like this that is not in a cloud-optimized data store and format, I think it makes sense to have a download() method that can pick up ~/.netrc (equivalent to wget https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc). For this particular example, it is then up to a user to load into xarray from the local file:

import xarray as xr
localFile = 'S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'
da = xr.open_dataset(localFile,
                     group='/science/grids/data')
da

thoughts @matthewhanson @jhamman @apawloski @martindurant ?

martindurant commented 4 years ago

What is contained in the .netcdf file, is it user/password for the HTTP call?

In general, you can use fsspec.open_local and a URL containing caching (or a local path), and get an experience on par with other fsspec operations. Parallel downloading of multiple files should not be far off either.

scottyhq commented 4 years ago

What is contained in the .netcdf file, is it user/password for the HTTP call?

cat ~/.netrc looks like this:

machine urs.earthdata.nasa.gov login MYUSERNAME  password MYPASSWORD

It looks like the requests library automatically picks up this file (code block below works). There is even a standard library module for reading it (https://docs.python.org/3/library/netrc.html)! but I'm unsure how to get fsspec to read it / pass username and password to HTTPFileSystem

url = item['data'].urlpath
with open('test.nc', 'wb') as f:
    resp = requests.get(url)
    f.write(resp.content)
martindurant commented 4 years ago

fsspec uses aiohttp, not requests, so maybe that's why it's not getting picked up automatically. In this case, it should work like

(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
of = fsspec.open(url, "rb", auth=(username, password)})
with of as f:
    ...

or

fs = fsspec.filesystem("http")  # can include auth here for all URLs, or specify with open
f = fs.open(url, "rb", auth=(username, password))

Actually, after a little reading, it seems that aiohttp does support this, if the client is passed trust_env=True (see https://github.com/aio-libs/aiohttp/pull/2584 ), but there is no way to get this arg to the client in fsspec right now. It would be easy to add (client_kwargs=None, for example, as done for s3fs), if someone is willing.

scottyhq commented 4 years ago

related PR over in sat-stac https://github.com/sat-utils/sat-stac/pull/62

scottyhq commented 4 years ago

Hi @martindurant - after trying a few other approaches to see how this works behind the scenes I'm a bit confused.

The following code works using aiohttp directly:

import aiohttp

url = item['data'].urlpath
auth=aiohttp.BasicAuth(username,password)

async with aiohttp.ClientSession(auth=auth) as session:
    async with session.get(url) as resp:
        print(resp.status)
        with open('local.nc', 'wb') as f:
            f.write(await resp.read())

I can't seem to get the ~/.netrc picked up, reading the PR you linked to and docs, maybe there is a separate workflow dealing with proxies that this gets into, because the following returns 401 Unauthorized Basic realm="Please enter your Earthdata Login credentials

async with aiohttp.ClientSession(trust_env=True) as session:
    async with session.get(url) as resp:
        print(resp.text)
        with open('local.nc', 'wb') as f:
            f.write(await resp.read())

If I use fssepc as you suggested with the following i get a FileNotFoundError

(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
auth=(username,password)
of = fsspec.open(url, "rb", auth=(username, password))
with of as remote:
    with open('local.nc', 'wb') as local:
        local.write(remote.read())

Finally, I thought this might work, but I get a ClientResponseError

fs = fsspec.filesystem("http", auth=aiohttp.BasicAuth(username,password))
with fs.open(url, "rb") as remote:
    with open('local.nc', 'wb') as local:
        local.write(remote.read())   

Interestingly for the last case, the traceback provides a link that if I click on in my browser the download works!?

ClientResponseError: 401, message='Unauthorized', url=URL('https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=iwntGSgHy9yoog7Mjag0dQ&response_type=code&redirect_uri=https://grfn.asf.alaska.edu/door/oauth&state=aHR0cDovL2dyZm4uYXNmLmFsYXNrYS5lZHUvZG9vci9kb3dubG9hZC9TMS1HVU5XLUEtUi0wODctdG9wcy0yMDE0MTAyM18yMDE0MTAxMS0xNTM4NTYtMjc1NDVOXzI1NDY0Ti1QUC0xYTFhLXYyXzBfMi5uYw')

Could you please advise on how to use fsspec directly? And where would be best to implement the reading of credentials (intake,fsspec,aiohttp,intake-stac?) from ~/.netrc so that a user doesn't have to write code to load them?

martindurant commented 4 years ago

And where would be best to implement the reading of credentials

The HttpFileSystem ought to have an option, so that you can pass the trust_env parameter - although it seems maybe that isn't working for you. I've never heard of .netrc before, but it doesn't sound stac-specific. If we can't get aiohttp to find and use it automatically, then fsspec would be the place to handle it.

Is there any chance you can share some creds privately so that I can test what works?

scottyhq commented 4 years ago

thanks for you help @martindurant !

there are definitely two things to figure out: 1) how to correctly pass username and password explicitly to httpfilesystem (the last code block seems close!) and 2) getting the netrc read correctly behind the scenes.

I can send you creds via keybase or however you prefer, it's also easy to register (https://urs.earthdata.nasa.gov/home) this is NASA's standard login which anyone can sign up for w/ some basic info.

martindurant commented 4 years ago

OK, I can sign up - but I won't get to this until next week now.

martindurant commented 4 years ago

It turns out, if you manually follow the redirect - i.e., apply the auth again to the generated URL - you can get the file. I feel like I'm getting somewhere.

martindurant commented 4 years ago

With https://github.com/intake/filesystem_spec/pull/400 , you can do

fs = fsspec.filesystem('http', client_kwargs={'auth': aiohttp.BasicAuth('mdurant', 'xx')})
with fs.open(url) as f:
    f.read()

I don't know why passing in the open kwargs or putting in .netrc isn't working, even with trust_env=True

scottyhq commented 4 years ago

Thanks @martindurant !

I don't know why passing in the open kwargs or putting in .netrc isn't working, even with trust_env=True

There definitely is something odd with how aiohttp handles the netrc auth. Short of opening an issue upstream, I'm wondering if in fsspec we could have an option that generates the aiohttp.BasicAuth from a netrc. For example fs = fsspec.filesystem('http', netrc_auth="urs.earthdata.nasa.gov")

I'm still unclear about how to get this into intake-stac as well. Seems like some sort of auth arguments need to be accepted here https://github.com/intake/intake-stac/blob/0fcde70ea04ac96b5909e027bd6f513064fbf042/intake_stac/catalog.py#L15, which get passed down the chain. For example:

from intake import open_stac_catalog
catalog_url = 'https://raw.githubusercontent.com/cholmes/sample-stac/master/stac/catalog.json'
cat = open_stac_catalog(catalog_url,  netrc_auth="urs.earthdata.nasa.gov")

Such that whenever a user opens a file, the auth settings are in place:

item = catalog['myitem']
da = item['data'].to_dask()
martindurant commented 4 years ago

Seem like it needs to migrate to this ilne, where we know the URL, and can do the login lookup. That should be the default, but probably the user should be able to override.

carygeo commented 3 years ago

The .netrc authentication steps in Remote NetCDF + Authentication from this example worked for me: https://github.com/intake/intake-stac/blob/master/examples/intake-cmr-stac.ipynb