Open TomAugspurger opened 4 years ago
thanks for the detailed issue @TomAugspurger , there have been a lot of changes to sat-search(>0.3) and intake-stac(>0.3) in the last couple months, with a new version of intake-stac just released yesterday. Long story short, the notebook needs some updating once those new versions are in the environment
Hi @scottyhq and @TomAugspurger , I was just running the landsat8 notebook on a newly installed Dask cluster on Azure K8S. I used the exact same version of satsearch (==0.2.3), but still cannot go through. Here are the details of the error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-10-47751b4b061a> in <module>
10 'landsat:tier=T1']
11
---> 12 results = Search.search(collection='landsat-8-l1',
13 bbox=bbox,
14 datetime=timeRange,
/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in search(cls, **kwargs)
61 del kwargs['sort']
62 kwargs['sort'] = sorts
---> 63 return Search(**kwargs)
64
65 def found(self):
/srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py in __init__(self, **kwargs)
26 """ Initialize a Search object with parameters """
27 self.kwargs = kwargs
---> 28 for k in self.kwargs:
29 if k == 'datetime':
30 self.kwargs['time'] = self.kwargs['datetime']
RuntimeError: dictionary keys changed during iteration
Any idea?
Mmm I'm not sure. That looks like a bug in satsearch. I believe that development focus is shifting from satsearch https://github.com/stac-utils/pystac-api-client, but I'm not sure how mature pystac-api-client is yet.
I think this would be a simple fix by changing line 28 to:
for k in list(self.kwargs):
But I cannot access and edit this file /srv/conda/envs/notebook/lib/python3.8/site-packages/satsearch/search.py as I am using dummy user on dask jupyterlab.
@TomAugspurger @ZihengSun , yes taking a step back this example really needs to be updated to use a different L8 dataset see https://github.com/pangeo-data/landsat-8-tutorial-gallery/issues/8 for some alternatives including accessing harmonized landsat sentinel2 (HLS) via NASA's CMR STAC endpoint.
It would be awesome if all public datasets in AWS, Azure, Google had up-to-date STAC metadata and search endpoints, but that is still very much a work in progress...
Yes, all a work in progress the magic Stack of STACs
Hi everyone! I was trying to reproduce this notebook and on the third cell the search throws an error, that seems to be related to the search call:
gaierror Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self) 159 conn = connection.create_connection( --> 160 (self._dns_host, self.port), self.timeout, **extra_kw 161 )
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 60 ---> 61 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): 62 af, socktype, proto, canonname, sa = res
/srv/conda/envs/notebook/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags) 751 addrlist = [] --> 752 for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 753 af, socktype, proto, canonname, sa = res
gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
NewConnectionError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 676 headers=headers, --> 677 chunked=chunked, 678 )
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 380 try: --> 381 self._validate_conn(conn) 382 except (SocketTimeout, BaseSSLError) as e:
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn) 977 if not getattr(conn, "sock", None): # AppEngine might not have
.sock
--> 978 conn.connect() 979/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in connect(self) 308 # Add certificate verification --> 309 conn = self._new_conn() 310 hostname = self.host
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self) 171 raise NewConnectionError( --> 172 self, "Failed to establish a new connection: %s" % e 173 )
NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 726 retries = retries.increment( --> 727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] 728 )
/srv/conda/envs/notebook/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 438 if new_retry.is_exhausted(): --> 439 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 440
MaxRetryError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffae1f56f10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
in 17 ) 18 ---> 19 print('%s items' % results.found()) 20 items = results.items() 21 items.save('subset.geojson') /srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in found(self) 73 } 74 kwargs.update(self.kwargs) ---> 75 results = self.query(**kwargs) 76 return results['meta']['found'] 77 /srv/conda/envs/notebook/lib/python3.7/site-packages/satsearch/search.py in query(cls, url, **kwargs) 80 """ Get request """ 81 logger.debug('Query URL: %s, Body: %s' % (url, json.dumps(kwargs))) ---> 82 response = requests.post(url, data=json.dumps(kwargs)) 83 # API error 84 if response.status_code != 200: /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs) 117 """ 118 --> 119 return request('post', url, data=data, json=json, **kwargs) 120 121 /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs) 59 # cases, and look like a memory leak in others. 60 with sessions.Session() as session: ---> 61 return session.request(method=method, url=url, **kwargs) 62 63 /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 528 } 529 send_kwargs.update(settings) --> 530 resp = self.send(prep, **send_kwargs) 531 532 return resp /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs) 641 642 # Send the request --> 643 r = adapter.send(request, **kwargs) 644 645 # Total elapsed time of the request (approximately) /srv/conda/envs/notebook/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 514 raise SSLError(e, request=request) 515 --> 516 raise ConnectionError(e, request=request) 517 518 except ClosedPoolError as e: ConnectionError: HTTPSConnectionPool(host='earth-search-legacy.aws.element84.com', port=443): Max retries exceeded with url: /stac/search (Caused by NewConnectionError(' : Failed to establish a new connection: [Errno -2] Name or service not known'))
For me, it seems that is an issue with the endpoint being called. Anyone can give a hint on how to solve this? I have already installed satsearch from its GitHub repo but seems to persist.
I'd recommend trying pystac-client (and updating the notebook if that works). Something like
catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")
Docs are at https://pystac-client.readthedocs.io/.
Thanks @TomAugspurger ! I will try, and if it works, I will fix and submit a PR, ok?
I was hoping to use this in a demo and found the same issue described in https://github.com/pangeo-data/landsat-8-tutorial-gallery/issues/6#issuecomment-1043267878. I was able to run
catalog = pystac_client.Client.open("https://earth-search.aws.element84.com/v0")
but could not figure out how to refactor the rest of the search in cell 3 to use the new API.
This gallery is a very important demonstration of Pangeo's capabilities in geospatial analysis. Let's get it working again!
I got dragged in lab affairs these last days but will be going over this later this week.
@rabernat the refactoring would be due to the method change basically, or are there new "errors"?
I'll take a look quick.
Gotta move on, but I have a start at https://gist.github.com/6051aa1705dc6797beccc9ac6e321ef3.
I'll pick things up later if I have a chance.
The Landsat example here may also be of interest, it uses pystac-client
and odc-stac
: https://github.com/Element84/geo-notebooks/blob/main/notebooks/odc-landsat.ipynb
Here's the rendered version
Hmm, now I'm having issues with accessing the data (e.g. the link at https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/047/027/LC08_L1TP_047027_20210630_20210708_01_T1/LC08_L1TP_047027_20210630_20210708_01_T1_thumb_large.jpg is giving a 40x error). Did that bucket recently change to requester-pays? For some reason I thought it had been requester pays for a while, but now I'm not sure.
I could change the source to the Planetary Computer's landsat collection in Azure, but would want an OK from @scottyhq before doing that. Or I could put that in a separate notebook.
Thanks for taking time to try and fix the now-dated example @TomAugspurger
Hmm, now I'm having issues with accessing the data
Yes, I think that s3://landsat-pds has finally been retired! See https://github.com/pydata/xarray/issues/6363#issuecomment-1068741201
There are now at least 4 options for cloud-hosted Landsat data (for better or worse)!
version | cloud | cloud region | authentication |
---|---|---|---|
NASA HLS v2 | AWS | us-west-2 | NASA URS |
USGS collection 2 | AWS | us-west-2 | requester-pays |
Landsat collection 2 | Azure | West Europe | SAS token |
Landsat collection 1 | US multi-region | public |
@rabernat @TomAugspurger Happy to merge a PR for an updated notebook... But in my mind Pangeo Gallery is primarily to illustrate 1. Large-scale examples of moving compute to the data and 2. Actually execute large examples as big integration tests and see when data or software problems come up (as they have here!). So my reluctance to continue maintaining this example with AWS datasets is due to the following:
Hello,
I have also been browsing a bunch of online materials to get started with intake for landsat data and have not yet been able to reproduce a single notebook.
I have signed up for Microsoft's planetary computer (since, correct me if I am mistaken, seems to be more aligned with the open source principles than Google's Earth Engine), and in the meantime, I have been trying to use the AWS datasets using an AWS account. However, when I try to convert to dask/xarray:
import pystac_client
import intake
from rasterio import session
aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)
stac_uri = "https://landsatlook.usgs.gov/stac-server"
collections = ["landsat-c2l1"]
client = pystac_client.Client.open(stac_uri)
results = client.search(
collections=collections,
bbox =...,
datetime=...)
items = results.get_all_items()
catalog = intake.open_stac_item_collection(items)
with rio.Env(aws_session):
ds = catalog[list(catalog)[0]]["blue"].to_dask()
I obtain:
KeyError Traceback (most recent call last)
File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
198 try:
--> 199 file = self._cache[self._key]
200 except KeyError:
File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/lru_cache.py:53, in LRUCache.__getitem__(self, key)
52 with self._lock:
---> 53 value = self._cache[key]
54 self._cache.move_to_end(key)
KeyError: [<function open at 0x7fe6b9d5ab00>, ('https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF',), 'r', ()]
During handling of the above exception, another exception occurred:
CPLE_AppDefinedError Traceback (most recent call last)
File rasterio/_base.pyx:302, in rasterio._base.DatasetBase.__init__()
File rasterio/_base.pyx:213, in rasterio._base.open_dataset()
File rasterio/_err.pyx:217, in rasterio._err.exc_wrap_pointer()
CPLE_AppDefinedError: Line 49: </head> doesn't have matching <head>.
During handling of the above exception, another exception occurred:
RasterioIOError Traceback (most recent call last)
Input In [33], in <cell line: 1>()
1 with rio.Env(aws_session):
----> 2 ds = catalog[list(catalog)[0]]["blue"].to_dask()
File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
File ~/path/to/conda/env/lib/python3.10/site-packages/intake/source/base.py:236, in DataSourceBase._load_metadata(self)
234 """load metadata only if needed"""
235 if self._schema is None:
--> 236 self._schema = self._get_schema()
237 self.dtype = self._schema.dtype
238 self.shape = self._schema.shape
File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:102, in RasterIOSource._get_schema(self)
99 self.urlpath, *_ = self._get_cache(self.urlpath)
101 if self._ds is None:
--> 102 self._open_dataset()
104 ds2 = xr.Dataset({'raster': self._ds})
105 metadata = {
106 'dims': dict(ds2.dims),
107 'data_vars': {k: list(ds2[k].coords)
(...)
110 'array': 'raster'
111 }
File ~/path/to/conda/env/lib/python3.10/site-packages/intake_xarray/raster.py:90, in RasterIOSource._open_dataset(self)
88 self._ds = self._open_files(files)
89 else:
---> 90 self._ds = xr.open_rasterio(files, chunks=self.chunks,
91 **self._kwargs)
File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/rasterio_.py:302, in open_rasterio(filename, parse_coordinates, chunks, cache, lock, **kwargs)
293 lock = RASTERIO_LOCK
295 manager = CachingFileManager(
296 rasterio.open,
297 filename,
(...)
300 kwargs=kwargs,
301 )
--> 302 riods = manager.acquire()
303 if vrt_params is not None:
304 riods = WarpedVRT(riods, **vrt_params)
File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:181, in CachingFileManager.acquire(self, needs_lock)
166 def acquire(self, needs_lock=True):
167 """Acquire a file object from the manager.
168
169 A new file is only opened if it has expired from the
(...)
179 An open file object, as returned by ``opener(*args, **kwargs)``.
180 """
--> 181 file, _ = self._acquire_with_cache_info(needs_lock)
182 return file
File ~/path/to/conda/env/lib/python3.10/site-packages/xarray/backends/file_manager.py:205, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
203 kwargs = kwargs.copy()
204 kwargs["mode"] = self._mode
--> 205 file = self._opener(*self._args, **kwargs)
206 if self._mode == "w":
207 # ensure file doesn't get overridden when opened again
208 self._mode = "a"
File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/env.py:442, in ensure_env_with_credentials.<locals>.wrapper(*args, **kwds)
439 session = DummySession()
441 with env_ctor(session=session):
--> 442 return f(*args, **kwds)
File ~/path/to/conda/env/lib/python3.10/site-packages/rasterio/__init__.py:277, in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
274 path = _parse_path(raw_dataset_path)
276 if mode == "r":
--> 277 dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
278 elif mode == "r+":
279 dataset = get_writer_for_path(path, driver=driver)(
280 path, mode, driver=driver, sharing=sharing, **kwargs
281 )
File rasterio/_base.pyx:304, in rasterio._base.DatasetBase.__init__()
RasterioIOError: Line 49: </head> doesn't have matching <head>.
Since the error seems to come from rasterio, I tried:
import boto3
import rasterio as rio
from rasterio import session
aws_session = session.AWSSession(boto3.Session(profile_name="aws"), requester_pays=True)
uri = "https://landsatlook.usgs.gov/data/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF"
with rio.Env(aws_session):
with rio.open(uri) as src:
print(src.profile)
but obtain the same error. However, if I change the https uri for its s3 version, i.e., "s3://usgs-landsat/collection02/level-1/standard/oli-tirs/2020/165/062/LC08_L1TP_165062_20201231_20210308_02_T1/LC08_L1TP_165062_20201231_20210308_02_T1_B2.TIF"
, everything works.
Also note that I tried setting the CURL_CA_BUNDLE
environment variable to "/etc/ssl/certs/ca-certificates.crt"
as suggested by @mdmaas (https://www.matecdev.com/posts/landsat-sentinel-aws-s3-python.html) but it did not work either.
Is this normal? How can I make the to_dask
method work for Earth on AWS?
For info: rasterio version is 1.3.0, intake version is 0.6.5, and pystac_client version is 0.4.0.
Thank you in advance. Best, Martí
RasterioIOError: Line 49: </head> doesn't have matching <head>.
This is suggestive of reading an HTML form rather than the actual file. I ran into this issue recently and found a solution here https://gis.stackexchange.com/questions/430026/gdalinfo-authenticate-for-remote-file. As a consequence of the authentication required for HTTP links it seems best to stick with the S3:// urls as you've discovered.
Here is a more up-to-date example I'd suggest following for landsatlook.usgs.gov
on AWS https://github.com/pangeo-data/cog-best-practices/issues/12#issuecomment-1180902929
Thanks @scottyhq for your response. I have managed to stream landsat data from AWS into xarray thanks to the example that you sent. If I am understanding the situation correctly, this means that intake should be updated to read s3 urls rather than https, right?
A collection of issues with latest versions of satsearch / intake-stac.
https://earth-search.aws.element84.com/v0
is the recommended URL, and you pass it likeresults = Search.search(url=api_url, collection='landsat-8-l1', ...)
.34675 items
so repeat that for each one?
scenid
hardcoded isn't present in thecatalog
ValueError
from https://github.com/intake/intake-stac/issues/64.