opengeospatial / ogcapi-geodatacubes

Other
4 stars 1 forks source link

Large scale processing (Synchronous, Asynchronous, On-demand) #9

Open jerstlouis opened 3 months ago

jerstlouis commented 3 months ago

From Testbed 19 GDC ER Critical Feedback:

  1. Supporting both synchronous and asynchronous processing is good, to support both prototype development and large scalable processing.

In OGC API - Processes - Part 3: Workflows, we define the "Collection Input" and "Collection Output" requirement classes, which I strongly recommend be part of the "Core Processing" profile (see ER Section 4.1 Profiles proposal by Ecere). What they allow is to efficiently access data resulting from processing, the exact same way as a regular static preprocessed GeoDataCube, and are completely suitable for large scalable processing, as an alternative to asynchronous "batch" processing which is much easier to manage. It completely avoids the need for job management, estimation, etc., instead relying on clients requesting small bits at a time as needed, and making it easier to prioritize clients. All this is explained in details in the Section 6.2 Design Goals of Processes - Part 3. At least 3 participants in Testbed 19 experimented and successfully implemented either (or both) Collection Input or Collection Output (Ecere, Compusult, Wuhan University).

As a concept, "Collection Input" is similar to the openEO LoadCollection process, but the collection does not need to have been produced before you can reference and use it. It's all done on-the-fly. Similarly, the "Collection Output" is somewhat similar to the STAC output, but rather than relying on STAC items to access output assets (which means accessing the whole thing unless you have a cloud based format like COG or Zarr), it relies on OGC APIs such as OGC API - Coverages. Making a request from the API (which can be a subset) generates only what is required, can pull whatever it needs from the workflow, triggering processing as needed.