Issues with latest satsearch

TomAugspurger commented 4 years ago

A collection of issues with latest versions of satsearch / intake-stac.

The search URL needs to be specified. I think https://earth-search.aws.element84.com/v0 is the recommended URL, and you pass it like results = Search.search(url=api_url, collection='landsat-8-l1', ...).
If that's the right URL, the search returns many more items (34,675 limited to 10,000). Compared to 25 before.
```
There are more items found (34675) than the limit (100) provided.
```

34675 items

3. `eo:bands` isn't present in the geoDataFrame: `band_info = pd.DataFrame(ast.literal_eval(gf.iloc[0]['eo:bands']))`. I think that it's available in the Item assets?

```python
eo_bands = [items[0].assets[f'B{i}']['eo:bands'] for i in range(1, 12)]

so repeat that for each one?

The scenid hardcoded isn't present in the catalog

sceneid = 'LC80470272019096'

The ValueError from https://github.com/intake/intake-stac/issues/64.

scottyhq commented 4 years ago

thanks for the detailed issue @TomAugspurger , there have been a lot of changes to sat-search(>0.3) and intake-stac(>0.3) in the last couple months, with a new version of intake-stac just released yesterday. Long story short, the notebook needs some updating once those new versions are in the environment

ZihengSun commented 3 years ago

Hi @scottyhq and @TomAugspurger , I was just running the landsat8 notebook on a newly installed Dask cluster on Azure K8S. I used the exact same version of satsearch (==0.2.3), but still cannot go through. Here are the details of the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-47751b4b061a> in <module>
     10                'landsat:tier=T1'] 
     11 
---> 12 results = Search.search(collection='landsat-8-l1', 
     13                         bbox=bbox,
     14                         datetime=timeRange,

/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in search(cls, **kwargs)
     61             del kwargs['sort']
     62             kwargs['sort'] = sorts
---> 63         return Search(**kwargs)
     64 
     65     def found(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in __init__(self, **kwargs)
     26         """ Initialize a Search object with parameters """
     27         self.kwargs = kwargs
---> 28         for k in self.kwargs:
     29             if k == 'datetime':
     30                 self.kwargs['time'] = self.kwargs['datetime']

RuntimeError: dictionary keys changed during iteration

Any idea?

TomAugspurger commented 3 years ago

Mmm I'm not sure. That looks like a bug in satsearch. I believe that development focus is shifting from satsearch https://github.com/stac-utils/pystac-api-client, but I'm not sure how mature pystac-api-client is yet.

ZihengSun commented 3 years ago

I think this would be a simple fix by changing line 28 to:

for k in list(self.kwargs):

But I cannot access and edit this file /srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py as I am using dummy user on dask jupyterlab.

scottyhq commented 3 years ago

@TomAugspurger @ZihengSun , yes taking a step back this example really needs to be updated to use a different L8 dataset see https://github.com/pangeo-data/landsat-8-tutorial-gallery/issues/8 for some alternatives including accessing harmonized landsat sentinel2 (HLS) via NASA's CMR STAC endpoint.

It would be awesome if all public datasets in AWS, Azure, Google had up-to-date STAC metadata and search endpoints, but that is still very much a work in progress...

RichardScottOZ commented 3 years ago

Yes, all a work in progress the magic Stack of STACs

ricardobarroslourenco commented 2 years ago

Hi everyone! I was trying to reproduce this notebook and on the third cell the search throws an error, that seems to be related to the search call:

gaierror Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self) 159 conn = connection.create_connection( --> 160 (self._dns_host, self.port), self.timeout, **extra_kw 161 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 60 ---> 61 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): 62 af, socktype, proto, canonname, sa = res

/srv/conda/envs/notebook/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags) 751 addrlist = [] --> 752 for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 753 af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

NewConnectionError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 676 headers=headers, --> 677 chunked=chunked, 678 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 380 try: --> 381 self._validate_conn(conn) 382 except (SocketTimeout, BaseSSLError) as e:

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn) 977 if not getattr(conn, "sock", None): # AppEngine might not have .sock --> 978 conn.connect() 979

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in connect(self) 308 # Add certificate verification --> 309 conn = self._new_conn() 310 hostname = self.host

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self) 171 raise NewConnectionError( --> 172 self, "Failed to establish a new connection: %s" % e 173 )

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 726 retries = retries.increment( --> 727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] 728 )

/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 438 if new_retry.is_exhausted(): --> 439 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 440

MaxRetryError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
in 17 ) 18 ---> 19 print('%s items' % results.found()) 20 items = results.items() 21 items.save('subset.geojson') /srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in found(self) 73 } 74 kwargs.update(self.kwargs) ---> 75 results = self.query(**kwargs) 76 return results['meta']['found'] 77 /srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in query(cls, url, **kwargs) 80 """ Get request """ 81 logger.debug('Query URL: %s, Body: %s' % (url, json.dumps(kwargs))) ---> 82 response = requests.post(url, data=json.dumps(kwargs)) 83 # API error 84 if response.status_code != 200: /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs) 117 """ 118 --> 119 return request('post', url, data=data, json=json, **kwargs) 120 121 /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs) 59 # cases, and look like a memory leak in others. 60 with sessions.Session() as session: ---> 61 return session.request(method=method, url=url, **kwargs) 62 63 /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 528 } 529 send_kwargs.update(settings) --> 530 resp = self.send(prep, **send_kwargs) 531 532 return resp /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs) 641 642 # Send the request --> 643 r = adapter.send(request, **kwargs) 644 645 # Total elapsed time of the request (approximately) /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 514 raise SSLError(e, request=request) 515 --> 516 raise ConnectionError(e, request=request) 517 518 except ClosedPoolError as e: ConnectionError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known'))

For me, it seems that is an issue with the endpoint being called. Anyone can give a hint on how to solve this? I have already installed satsearch from its GitHub repo but seems to persist.

TomAugspurger commented 2 years ago

I'd recommend trying pystac-client (and updating the notebook if that works). Something like

catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")

Docs are at https://pystac-client.readthedocs.io/.

ricardobarroslourenco commented 2 years ago

Thanks @TomAugspurger ! I will try, and if it works, I will fix and submit a PR, ok?

rabernat commented 2 years ago

I was hoping to use this in a demo and found the same issue described in https://github.com/pangeo-data/landsat-8-tutorial-gallery/issues/6#issuecomment-1043267878. I was able to run

catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")

but could not figure out how to refactor the rest of the search in cell 3 to use the new API.

This gallery is a very important demonstration of Pangeo's capabilities in geospatial analysis. Let's get it working again!

ricardobarroslourenco commented 2 years ago

I got dragged in lab affairs these last days but will be going over this later this week.

@rabernat the refactoring would be due to the method change basically, or are there new "errors"?

TomAugspurger commented 2 years ago

I'll take a look quick.

TomAugspurger commented 2 years ago

Gotta move on, but I have a start at https://gist.github.com/6051aa1705dc6797beccc9ac6e321ef3.

The STAC items in AWS changed their IDs and structure
The pystac-client API differs a bit from sat-serach
The returned STAC items don't seem to work with whatever intake-stac is expected to do it's stuff. I halfway switched over to using stackstac instead, but the output structure is at odds with what the hvplot code was expecting (which is where I have to leave it for now).

I'll pick things up later if I have a chance.

rsignell-usgs commented 2 years ago

The Landsat example here may also be of interest, it uses pystac-client and odc-stac: https://github.com/Element84/geo-notebooks/blob/main/notebooks/odc-landsat.ipynb

Here's the rendered version

TomAugspurger commented 2 years ago

Hmm, now I'm having issues with accessing the data (e.g. the link at https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/047/027/LC08_L1TP_047027_20210630_20210708_01_T1/LC08_L1TP_047027_20210630_20210708_01_T1_thumb_large.jpg is giving a 40x error). Did that bucket recently change to requester-pays? For some reason I thought it had been requester pays for a while, but now I'm not sure.

I could change the source to the Planetary Computer's landsat collection in Azure, but would want an OK from @scottyhq before doing that. Or I could put that in a separate notebook.

scottyhq commented 2 years ago

Thanks for taking time to try and fix the now-dated example @TomAugspurger

Hmm, now I'm having issues with accessing the data

Yes, I think that s3://landsat-pds has finally been retired! See https://github.com/pydata/xarray/issues/6363#issuecomment-1068741201

There are now at least 4 options for cloud-hosted Landsat data (for better or worse)!

version	cloud	cloud region	authentication
NASA HLS v2	AWS	us-west-2	NASA URS
USGS collection 2	AWS	us-west-2	requester-pays
Landsat collection 2	Azure	West Europe	SAS token
Landsat collection 1	Google	US multi-region	public

@rabernat @TomAugspurger Happy to merge a PR for an updated notebook... But in my mind Pangeo Gallery is primarily to illustrate 1. Large-scale examples of moving compute to the data and 2. Actually execute large examples as big integration tests and see when data or software problems come up (as they have here!). So my reluctance to continue maintaining this example with AWS datasets is due to the following:

Does binderbot still work since the AWS binderhub requires github login?
How to automatically execute code requiring credentials (required for NASA URS or requester pays)?

martibosch commented 2 years ago

Hello,

I have also been browsing a bunch of online materials to get started with intake for landsat data and have not yet been able to reproduce a single notebook.

I have signed up for Microsoft's planetary computer (since, correct me if I am mistaken, seems to be more aligned with the open source principles than Google's Earth Engine), and in the meantime, I have been trying to use the AWS datasets using an AWS account. However, when I try to convert to dask/xarray:

import pystac_client
import intake
from rasterio import session

aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)
stac_uri =  "https://landsatlook.usgs.gov/stac-server"
collections = ["landsat-c2l1"]

client = pystac_client.Client.open(stac_uri)

results = client.search(
        collections=collections,
        bbox =...,
        datetime=...)
items = results.get_all_items()
catalog = intake.open_stac_item_collection(items)
with rio.Env(aws_session):
    ds = catalog[list(catalog)[0]]["blue"].to_dask()

I obtain:

  KeyError                                  Traceback (most recent call last)
  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
      198 try:
  --> 199     file = self._cache[self._key]
      200 except KeyError:

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/lru_cache.py:53, in LRUCache.__getitem__(self, key)
       52 with self._lock:
  ---> 53     value = self._cache[key]
       54     self._cache.move_to_end(key)

  KeyError: [<function open at 0x7fe6b9d5ab00>, ('https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF',), 'r', ()]

  During handling of the above exception, another exception occurred:

  CPLE_AppDefinedError                      Traceback (most recent call last)
  File rasterio/_base.pyx:302, in rasterio._base.DatasetBase.__init__()

  File rasterio/_base.pyx:213, in rasterio._base.open_dataset()

  File rasterio/_err.pyx:217, in rasterio._err.exc_wrap_pointer()

  CPLE_AppDefinedError: Line 49: </head> doesn't have matching <head>.

  During handling of the above exception, another exception occurred:

  RasterioIOError                           Traceback (most recent call last)
  Input In [33], in <cell line: 1>()
        1 with rio.Env(aws_session):
  ----> 2     ds = catalog[list(catalog)[0]]["blue"].to_dask()

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
       67 def to_dask(self):
       68     """Return xarray object where variables are dask arrays"""
  ---> 69     return self.read_chunked()

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
       42 def read_chunked(self):
       43     """Return xarray object (which will have chunks)"""
  ---> 44     self._load_metadata()
       45     return self._ds

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake/source/base.py:236, in DataSourceBase._load_metadata(self)
      234 """load metadata only if needed"""
      235 if self._schema is None:
  --> 236     self._schema = self._get_schema()
      237     self.dtype = self._schema.dtype
      238     self.shape = self._schema.shape

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
       99 self.urlpath, *_ = self._get_cache(self.urlpath)
      101 if self._ds is None:
  --> 102     self._open_dataset()
      104     ds2 = xr.Dataset({'raster': self._ds})
      105     metadata = {
      106         'dims': dict(ds2.dims),
      107         'data_vars': {k: list(ds2[k].coords)
     (...)
      110         'array': 'raster'
      111     }

  File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
       88     self._ds = self._open_files(files)
       89 else:
  ---> 90     self._ds = xr.open_rasterio(files, chunks=self.chunks,
       91                                 **self._kwargs)

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/rasterio_.py:302, in open_rasterio(filename, parse_coordinates, chunks, cache, lock, **kwargs)
      293     lock = RASTERIO_LOCK
      295 manager = CachingFileManager(
      296     rasterio.open,
      297     filename,
     (...)
      300     kwargs=kwargs,
      301 )
  --> 302 riods = manager.acquire()
      303 if vrt_params is not None:
      304     riods = WarpedVRT(riods, **vrt_params)

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:181, in CachingFileManager.acquire(self, needs_lock)
      166 def acquire(self, needs_lock=True):
      167     """Acquire a file object from the manager.
      168 
      169     A new file is only opened if it has expired from the
     (...)
      179         An open file object, as returned by ``opener(*args, **kwargs)``.
      180     """
  --> 181     file, _ = self._acquire_with_cache_info(needs_lock)
      182     return file

  File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:205, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
      203     kwargs = kwargs.copy()
      204     kwargs["mode"] = self._mode
  --> 205 file = self._opener(*self._args, **kwargs)
      206 if self._mode == "w":
      207     # ensure file doesn't get overridden when opened again
      208     self._mode = "a"

  File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/env.py:442, in ensure_env_with_credentials.<locals>.wrapper(*args, **kwds)
      439     session = DummySession()
      441 with env_ctor(session=session):
  --> 442     return f(*args, **kwds)

  File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/__init__.py:277, in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
      274 path = _parse_path(raw_dataset_path)
      276 if mode == "r":
  --> 277     dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
      278 elif mode == "r+":
      279     dataset = get_writer_for_path(path, driver=driver)(
      280         path, mode, driver=driver, sharing=sharing, **kwargs
      281     )

  File rasterio/_base.pyx:304, in rasterio._base.DatasetBase.__init__()

  RasterioIOError: Line 49: </head> doesn't have matching <head>.

Since the error seems to come from rasterio, I tried:

import boto3
import rasterio as rio
from rasterio import session

aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)

uri = "https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF"

with rio.Env(aws_session):
    with rio.open(uri) as src:
        print(src.profile)

but obtain the same error. However, if I change the https uri for its s3 version, i.e., "s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF", everything works.

Also note that I tried setting the CURL_CA_BUNDLE environment variable to "/etc/ssl/certs/ca-certificates.crt" as suggested by @mdmaas (https://www.matecdev.com/posts/landsat-sentinel-aws-s3-python.html) but it did not work either.

Is this normal? How can I make the to_dask method work for Earth on AWS?

For info: rasterio version is 1.3.0, intake version is 0.6.5, and pystac_client version is 0.4.0.

Thank you in advance. Best, Martí

scottyhq commented 2 years ago

RasterioIOError: Line 49: </head> doesn't have matching <head>.

This is suggestive of reading an HTML form rather than the actual file. I ran into this issue recently and found a solution here https://gis.stackexchange.com/questions/430026/gdalinfo-authenticate-for-remote-file. As a consequence of the authentication required for HTTP links it seems best to stick with the S3:// urls as you've discovered.

Here is a more up-to-date example I'd suggest following for landsatlook.usgs.gov on AWS https://github.com/pangeo-data/cog-best-practices/issues/12#issuecomment-1180902929

martibosch commented 2 years ago

Thanks @scottyhq for your response. I have managed to stream landsat data from AWS into xarray thanks to the example that you sent. If I am understanding the situation correctly, this means that intake should be updated to read s3 urls rather than https, right?

pangeo-data / landsat-8-tutorial-gallery

Issues with latest satsearch #6