intake / intake-stac

Intake interface to STAC data catalogs
https://intake-stac.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
108 stars 25 forks source link

Iterating through catalog items is awkward and slow #66

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

A common need is getting URLs from item assets within a catalog, which involves iterating over hundreds of items. Here is a quick example:

import satsearch
import intake

bbox = [35.48, -3.24, 35.58, -3.14] # (min lon, min lat, max lon, max lat)
dates = '2010-07-01/2020-08-15'

URL='https://earth-search.aws.element84.com/v0'
results = satsearch.Search.search(url=URL,
                                  collections=['sentinel-s2-l2a-cogs'], # note collection='sentinel-s2-l2a-cogs' doesn't work
                                  datetime=dates,
                                  bbox=bbox,    
                                  sortby=['+properties.datetime'])
print('%s items' % results.found())
itemCollection = results.items()
#489 items

Initializing the catalog is fast!

%%time 
catalog = intake.open_stac_item_collection(itemCollection)
#CPU times: user 3.69 ms, sys: 0 ns, total: 3.69 ms
#Wall time: 3.7 ms

Iterating through items is slow. I'm a bit confused by the syntax too. I find myself wanting to use an integer index to get the first item in a catalog (first_item = catalog[0]) or simplify the code block, but currentlty below to hrefs = [item.band.metadata.href for item in catalog] (currently iterating through catalogs returns item IDs as strings.

%%time 
band = 'visual'
hrefs = [catalog[item][band].metadata['href'] for item in catalog]
#CPU times: user 4.6 s, sys: 1.23 ms, total: 4.6 s
#Wall time: 4.61 s

As for speed, it only takes microseconds to iterate through the underlaying JSON via sat-stac

%%time 
band = 'visual'
hrefs = [i.assets[band]['href'] for i in catalog._stac_obj]
#CPU times: user 684 µs, sys: 0 ns, total: 684 µs
#Wall time: 689 µs

@martindurant any suggestions here? I'm a bit perplexed about where the code lives to handle list(catalog) or for item in catalog: ...

martindurant commented 3 years ago

If you do a profile, I assume that all the time is being spent in instantiating the Source objects, which involves (I think) talking to the server - but you don't actually need this just to look at the hrefs. You should be able to iterate the entries dict (without instantiating) with

catalog._get_entries()

which is the method used in internal methods such as walk, to get all the entries. Each entry has a describe() method that gets you everything including the metadata and should need no processing (or access ._metadata directly).

TomAugspurger commented 3 years ago

To better understand what's going on, I added some logging to all our _load methods and compared performance with pystac.

```diff diff --git a/intake_stac/catalog.py b/intake_stac/catalog.py index 145c590..47da8f6 100644 --- a/intake_stac/catalog.py +++ b/intake_stac/catalog.py @@ -1,3 +1,4 @@ +import logging import os.path import warnings @@ -7,6 +8,8 @@ from intake.catalog.local import LocalCatalogEntry from pkg_resources import get_distribution __version__ = get_distribution('intake_stac').version +logger = logging.getLogger(__name__) + # STAC catalog asset 'type' determines intake driver: # https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#media-types @@ -116,6 +119,7 @@ class StacCatalog(AbstractStacCatalog): Load the STAC Catalog. """ for subcatalog in self._stac_obj.get_children(): + logger.info("Loading subcatalog %s from %s", subcatalog, self) if isinstance(subcatalog, pystac.Collection): # Collection subclasses Catalog, so check it first driver = StacCollection @@ -131,6 +135,7 @@ class StacCatalog(AbstractStacCatalog): ) for item in self._stac_obj.get_items(): + logger.info("Loading item %s from %s", item, self) self._entries[item.id] = LocalCatalogEntry( name=item.id, description='', @@ -182,6 +187,7 @@ class StacItemCollection(AbstractStacCatalog): if not self._stac_obj.ext.implements('single-file-stac'): raise ValueError("StacItemCollection requires 'single-file-stac' extension") for feature in self._stac_obj.ext['single-file-stac'].features: + logger.info("Loading feature %s from %s", feature, self) self._entries[feature.id] = LocalCatalogEntry( name=feature.id, description='', @@ -232,6 +238,7 @@ class StacItem(AbstractStacCatalog): Load the STAC Item. """ for key, value in self._stac_obj.assets.items(): + logger.info("Loading asset %s from %s", key, self) self._entries[key] = StacAsset(key, value) def _get_metadata(self, **kwargs): ```

Here's pystac loading an asset from a fairly nested catalog:

In [1]: import pystac

In [2]: %time pcat = pystac.read_file("https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json")
CPU times: user 11.9 ms, sys: 1.02 ms, total: 12.9 ms
Wall time: 317 ms

In [3]: %time pcat.get_child("CBERS4A").get_child("CBERS4A-WPM").get_child("CBERS4A WPM 209").get_child("CBERS4A WPM 209/139").get_item("CBERS_4A_WPM_20200730_209_139_L4").get_assets()['B0']
   ...:
CPU times: user 47.2 ms, sys: 11.8 ms, total: 58.9 ms
Wall time: 1.67 s
Out[3]: <Asset href=s3://cbers-pds/CBERS4A/WPM/209/139/CBERS_4A_WPM_20200730_209_139_L4/CBERS_4A_WPM_20200730_209_139_L4_BAND0.tif>

And here's intake-stac

In [4]: import intake, coloredlogs; coloredlogs.install()

In [5]: %time cat = intake.open_stac_catalog("https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json")
2021-06-24 10:21:37 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4> from <Intake catalog: CBERS>
2021-06-24 10:21:37 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A> from <Intake catalog: CBERS>
CPU times: user 153 ms, sys: 9.79 ms, total: 163 ms
Wall time: 966 ms

In [6]: %time cat['CBERS4A']["CBERS4A-WPM"]["CBERS4A WPM 209"]["CBERS4A WPM 209/139"]["CBERS_4A_WPM_20200730_209_139_L4"]["B0"]
2021-06-24 10:21:43 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-WPM> from <Intake catalog: CBERS4A>
2021-06-24 10:21:44 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-MUX> from <Intake catalog: CBERS4A>
2021-06-24 10:21:44 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Collection id=CBERS4A-WFI> from <Intake catalog: CBERS4A>
2021-06-24 10:21:45 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A WPM 209> from <Intake catalog: CBERS4A-WPM>
2021-06-24 10:21:46 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading subcatalog <Catalog id=CBERS4A WPM 209/139> from <Intake catalog: CBERS4A WPM 209>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading item <Item id=CBERS_4A_WPM_20200730_209_139_L4> from <Intake catalog: CBERS4A WPM 209/139>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset thumbnail from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset metadata from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B0 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B1 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B2 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B3 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
2021-06-24 10:21:47 DESKTOP-D37TN6N intake_stac.catalog[5256] INFO Loading asset B4 from <Intake catalog: CBERS_4A_WPM_20200730_209_139_L4>
CPU times: user 757 ms, sys: 999 ms, total: 1.76 s
Wall time: 4.2 s

I think that we make an HTTP request (both intake-stac and pystac) each time we load a sub catalog / collection. IIUC though, intake-stac will load all the sub catalogs/collections (e.g. the first log lines mention both CBERS4 and CBERS4A, which are siblings, but we only use CBERS4SA.). So maybe that's the first place to look: delay loading children till necessary?

martindurant commented 3 years ago

We have been going through a similar discussion at intake-sdmk (lazy dict linked), where it would be unfeasible to get all the metadata of al the possible catalogs and items when creating the hierarchy.

TomAugspurger commented 3 years ago

Thanks. In this case things might be easier. I think we always know the keys for subcatalogs / sub-collections:

In [18]: pcat.to_dict()
Out[18]:
{'type': 'Catalog',
 'id': 'CBERS',
 'stac_version': '1.0.0',
 'description': "Catalogs from CBERS 4/4A missions' imagery on AWS",
 'links': [{'rel': <RelType.SELF: 'self'>,
   'href': 'https://cbers-stac-1-0-0.s3.amazonaws.com/catalog.json',
   'type': <MediaType.JSON: 'application/json'>},
  {'rel': <RelType.ROOT: 'root'>,
   'href': './catalog.json',
   'type': <MediaType.JSON: 'application/json'>},
  {'rel': 'child', 'href': './CBERS4/catalog.json'},
  {'rel': 'child', 'href': './CBERS4A/catalog.json'}],
 'stac_extensions': [],
 'title': 'CBERS on AWS'}

In [19]: pcat.get_child_links()
Out[19]:
[<Link rel=child target=<Catalog id=CBERS4>>,
 <Link rel=child target=<Catalog id=CBERS4A>>]

So we can know the keys of sub collection / catalogs statically, without additional HTTP requests. Accessing that key would require loading the catalog (another HTTP request). I think this also applies to child items in a Catalog, just using get_item_links(). And presumably it works for assets too.

I'm unsure how dynamic catalogs complicate this.

TomAugspurger commented 3 years ago

Here's a proposed solution: loading a Catalog / Collection will eagerly store the child links, available on stac_obj.get_child_links(). I believe this is always cheap to do, just iterating over a list and some comparisions. We'll store these as some kind of dict

self._links : Dict[str, StacObject] = {link.target.id: link.target link for link in stac_obj.get_child_links()}

Then when users look up an item, we'll check for the key in links. If it's not there, then raise. If it is there, then we'll wrap the link in the appropriate container (intake.catalog.StacCatalog, etc.). Presumably we'd have a custom ._entries mutable mapping that caches things.

My two unknowns right now:

  1. How does this interact, if at all, with LocalCatalogEntry?
  2. get_children() calls StacObject.resolve_link(root=self.get_root_link()). When, if ever, does resolve_link need to be called? I'm not sure what it does, but it seems to be relatively slow, so ideally we do it at getitem time (and cache the result).
martindurant commented 3 years ago

How does this interact, if at all, with LocalCatalogEntry?

Normally, a getitem returns one of these as its value - which is primitive lazyness (entries should be cheap to create, data source instances can have a fixed cost). In your case, you might still make an entry if there is any scope for parameters. But since you have your laziness up front in the dict-like, you could maybe skip the entry altogether.