Cirrus geospatial pipeline

Hey all, @scottyhq mentioned this forge thing a few weeks ago, and @rsignell-usgs suggested I post here. We recently open-sourced Cirrus, an AWS pipeline for processing geospatial data. On the surface it sounds like this is similar to Pangeo forge but after some reading it seems like they are pretty different things, although there might be some synergistic efforts we could discuss.

What is Cirrus? Cirrus is a mostly serverless AWS architecture that makes use of DynamoDB, Lambda, SNS, SQS, Step Functions, and Batch. Cirrus is meant for scaling up, and doing both historical and ongoing real-time processing of geospatial data via STAC metadata and assets.

See the repo and docs on architecture and usage.

What can you use Cirrus for? Cirrus has served as the backend processing architecture for a few projects. With it you can:

Fetch metadata from another source, transform to STAC and publish it
Fetch STAC metadata from an API, s3 bucket (or s3 inventory) and perform processing on assets such as copying, converting to COG, or generating preview images and thumbnails
Count how many jobs are in a state of PROCESSING, COMPLETED, or FAILED by datetime or collection via an API
Monitor and track jobs through the system, links provided to original input and complete Step Function execution logs for all tasks within a workflow
Rerun jobs based on last state, datetime, or collection
Publish STAC metadata via SNS to be consumed by something like STAC-server
Maintain data provenance when transforming data by including "derived_from" and "copied_from" links back to the source STAC metadata
Chain together Cirrus workflows by providing default processes for collections. Cirrus can consume it's own published data and assign a new workflow based on the the Collection the Item belongs to. e.g, L0->L1->L2

Sentinel-2 COG Public Dataset I've used Cirrus to create the new Sentinel-2 COGs. This started off for just Africa, but we are currently more than half way through processing the entire global archive from the original JP2K format, about 6 million scenes. This past weekend I generated nearly 2 million converted Sentinel scenes using Cirrus. Each scene is 17 COGs, so that's 34 million file conversions in a couple days. All failures and successes tracked, so that errors can be identified and dealt with and subsequent runs later don't reprocess completed data.

First I indexed the existing AWS Public datasets: 20 million sentinel-s2-l1c and sentinel-s2-l2a scenes in JP2K, by converting the sentinel metadata (tileInfo.json) into STAC Items and indexing those. Then using that STAC API, get the L2A scenes and convert them to COGs, linking back to the L2A record for provenance, and publishing/indexing those back in Earth-Search (the sentinel-s2-l2a-cogs collection). All these steps for publishing, converting, copying, were done with Cirrus.

The COG archive is available through Earth-search: https://earth-search.aws.element84.com/v0

How's this relate to Forge? My understanding so far about Forge is that it's about about developing a flexible system where users can define new processes, share those processes and control processing through an API. In Cirrus, a user can create a task (either Lambda or Batch) and workflows using that task easily enough, but there is no API for controlling processing, and Cirrus doesn't provide a capability of plug-ins or similar. To deploy Cirrus with additions you fork the repo and make additions there. I haven't gotten a detailed roadmap yet, but one of the ideas was to split out the various Lambdas, tasks and core pieces so that a user could more easily assemble workflows from building blocks they could get from multiple places. Which seems more like what Forge is about.

Thoughts? What are the possible touch points here?

pangeo-forge / pangeo-forge-recipes

Cirrus geospatial pipeline #12