Open matthewhanson opened 4 years ago
Hi @matthewhanson and thanks for taking the time to share! Sorry no one got back to you for so long.
We just published a new documentation site that clarifies how Pangeo Forge will work: https://pangeo-forge.readthedocs.io/en/latest/index.html
I'll look into Cirrus and consider the questions you raised.
Hey all, @scottyhq mentioned this forge thing a few weeks ago, and @rsignell-usgs suggested I post here. We recently open-sourced Cirrus, an AWS pipeline for processing geospatial data. On the surface it sounds like this is similar to Pangeo forge but after some reading it seems like they are pretty different things, although there might be some synergistic efforts we could discuss.
What is Cirrus? Cirrus is a mostly serverless AWS architecture that makes use of DynamoDB, Lambda, SNS, SQS, Step Functions, and Batch. Cirrus is meant for scaling up, and doing both historical and ongoing real-time processing of geospatial data via STAC metadata and assets.
See the repo and docs on architecture and usage.
What can you use Cirrus for? Cirrus has served as the backend processing architecture for a few projects. With it you can:
Sentinel-2 COG Public Dataset I've used Cirrus to create the new Sentinel-2 COGs. This started off for just Africa, but we are currently more than half way through processing the entire global archive from the original JP2K format, about 6 million scenes. This past weekend I generated nearly 2 million converted Sentinel scenes using Cirrus. Each scene is 17 COGs, so that's 34 million file conversions in a couple days. All failures and successes tracked, so that errors can be identified and dealt with and subsequent runs later don't reprocess completed data.
First I indexed the existing AWS Public datasets: 20 million sentinel-s2-l1c and sentinel-s2-l2a scenes in JP2K, by converting the sentinel metadata (tileInfo.json) into STAC Items and indexing those. Then using that STAC API, get the L2A scenes and convert them to COGs, linking back to the L2A record for provenance, and publishing/indexing those back in Earth-Search (the
sentinel-s2-l2a-cogs
collection). All these steps for publishing, converting, copying, were done with Cirrus.The COG archive is available through Earth-search: https://earth-search.aws.element84.com/v0
How's this relate to Forge? My understanding so far about Forge is that it's about about developing a flexible system where users can define new processes, share those processes and control processing through an API. In Cirrus, a user can create a task (either Lambda or Batch) and workflows using that task easily enough, but there is no API for controlling processing, and Cirrus doesn't provide a capability of plug-ins or similar. To deploy Cirrus with additions you fork the repo and make additions there. I haven't gotten a detailed roadmap yet, but one of the ideas was to split out the various Lambdas, tasks and core pieces so that a user could more easily assemble workflows from building blocks they could get from multiple places. Which seems more like what Forge is about.
Thoughts? What are the possible touch points here?