Workflow examples and use cases ...

opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes

Other

45 stars 45 forks source link

Workflow examples and use cases ... #279

Open pvretano opened 2 years ago

pvretano commented 2 years ago

Following on from issue #278, the purpose of this issue is to capture examples of workflows from the various approaches (OpenEO, OAPIP Part 3, etc.), compare them and see where there is commonality and where there are differences. The goal is to converge on some conformance classes for Part 3.

Be specific with the examples, provide code if you can and try to make then not-too-long! ;)

fmigneault commented 2 years ago

Using CWL, the Workflow process WorkflowStageCopyImages is deployed using this definition: https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/WorkflowStageCopyImages/deploy.json

It encapsulate 2 chained processes (steps), defined using following deployments respectively: https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/DockerStageImages/deploy.json https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/DockerCopyImages/deploy.json

All 3 processes embed the CWL definition in their executionUnit[0].unit field.

Execution uses the following payload: https://github.com/crim-ca/weaver/blob/master/tests/functional/application-packages/WorkflowStageCopyImages/execute.json

On submitted execution, the workflow will run the process chain, first "generating an image" from the input string, and the second process will do a simple pass-through of the file contents.

The chaining logic is all defined by CWL. Because of in/out entries under steps in the workflow, it is possible to connect, parallelize, aggregate, etc. I/O however we want, without any need for reprocessing if there is duplicating of data sources for intermediate steps.

OGC API - Processes itself has no actual knowledge of single Process vs Workflow chaining. The implementer can decide to parse the CWL and execute it as they see fit. From the external user point of view, atomic and workflow processes are distinguishable in terms of inputs/outputs. If need be, intermediate processes can be executed by themselves as well.

Side Notes

Processes use deployment, but they could well be pre-deployed or builtin in the application. This is irrelevant for Part 3.
Sample workflow uses (eg. "run": "DockerStageImages") to refer to the chained processes. This can be replaced by a full-URL to dispatch executions on distinct OGC API - Processes instances if need be. In this case, it assumes "same instance" with "run": "{processId}".
Definitions use the "old" OGC schemas where inputs/outputs were defined as lists of objects rather than the current <id>:definition mapping. This is not an issue, I just haven't converted the samples because our implementation supports both variants.

jerstlouis commented 2 years ago

Examples and use cases for OGC API - Processes - Part 3: Workflows & Chaining

(apologies for a complete failure at trying to make it not-too-long)

Scenario 1: Land Cover Classification (collection input / remote collections / collection output)

Say we have a server providing a vast collections of sentinel-2 data to which new scenes captured by the satellites get added continuously. That data is hypothetically available from a hypothetical OGC API implementation deployed at https://esa.int/ogcapi/collections/sentinel2:level2A and supports a number of OGC API specifications, including Coverages, (coverage) Tiles, EDR and DGGS.

Say we have another server providing MODIS data at https://usgs.gov/ogcapi/collections/modis which has a lower spatial resolution, but higher temporal resolution.

Research center A has developed and trained a Machine Learning model able to classify land cover from MODIS and sentinel-2 data and published it as a Process in an OGC API - Processes implementation with support for Part 3 at https://research-alpha.org/ogcapi/processes/landcover. The process has some degree of flexibility which allows to tweak the results of the classification.

Research center B wants to experiment with land cover classification, and they use their favorite OGC API client and first discovers the existence of the Land Cover classification process searching for "land cover" keywords from a central OGC catalog of trusted OGC API deployments of certified implementations. The client fetches the process description, and from it it can see what types of inputs are expected. Inputs are qualified with a geodata class which allows the client to easily discover implementations able to supply the data it needs. From the same central OGC catalog, it discovers the MODIS and sentinel-2 data sources as perfect fits, and automatically generates a workflow execution request that looks like this (despite its simplicity, the user still does not need to see it):

{
   "process" : "https://research-alpha.org/ogcapi/processes/landcover",
   "inputs" : {
      "modis_data" : { "collection" : "https://usgs.gov/ogcapi/collections/modis" },
      "sentinel2_data" : { "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A" }
   }
}

Happy to first try the defaults, the researcher clicks OK. By default the process generates a land cover classification for one specific year. This results in POSTing the execution request to the process execution end-point at https://research-alpha.org/ogcapi/processes/landcover/execution with a response=collection query parameter indicating that the client wishes to use the Workflows & Chaining collection output conformance class / execution mode. The response will be a collection description document including details such as the spatiotemporal extent, links to the access mechanisms and media types supported for the result.

When receiving the request, the first thing the Workflows implementation on research-alpha.org will do is to validate those collection URLs as safe and retrieve their collection description to verify that they are proper inputs. This includes parsing information about the spatiotemporal extent of the collections as well as data access mechanism (e.g. Coverages, Tiles, DGGS...) and supported media types. The server recognizes the inputs as valid (e.g. it sees that their geodata class is a match) and plans on using OGC API - Tiles to retrieve data from the server, since both data inputs advertise support for coverage tiles. Confident that it can acommmodate the workflow being registered, the server responds to the request by generating a collection description document where the spatiotemporal extent spans the intersection of both inputs (e.g. 2016..last year for the whole Earth). The document also declares that the results can be requested either via OGC API - Coverages (with discrete categories), as OGC API - Features, or as OGC API - Tiles (either as coverage tiles or vector tiles).

The client works best with vector tiles (as it uses Vulkan or WebGL to render them client-side), and supports Mapbox Vector Tiles which is one of the media types declared as supported in the response. The response included a link to tilesets of the results of the workflow execution request as Mapbox Vector Tiles. The client selects a tileset using the GNOSISGlobalGrid TileMatrixSet which is suitable for EPSG:4326 / CRS:84 for the whole world (including polar regions). That tileset includes a templated link to trigger processing of a particular resolution and area and request the result for a specific tile: https://research-alpha.org/ogcapi/internal-workflows/600d-c0ffee/tiles/GNOSISGlobalGrid/{tileMatrix}/{tileRow}/{tileCol}.mvt.

The client now requests tiles for the current visualization scale and extent currently displayed on its virtual globe, by replacing the parameter variables with tile matrices, rows and colums. Since the collection also advertised a temporal extent with a yearly resolution and support for OGC API - Tiles datetime conformance class, the client also specified that it is interested in last year with an additional datetime="2021-01-01T00:0:00Z" query parameter.

The research-alpha.org receives the requests and starts distributing the work. First it needs to acquire the data from the source collections. It sends request to retrieve MODIS and sentinel-2 data tiles.

The sentinel-2 server supports a "filter" query parameter allowing to filter data by cloud cover at both the scene metadata as well as the cell data values level to create a cloud-free mosaic of multiple scenes, e.g. "filter=scene.cloud_cover < 50 AND cell.cloud_cover < 15". It also supports returning a flattened GeoTIFF when requesting a temporal interval and a "sortby" parameter to order with the least cloud cover will be preserved (on top): "sortby=cell.cloud_cover(desc)".

The trained model requires imagery from different times during the year, so it uses the datetime query parameter to request from the sentinel-2 monthly interval images, with the least amount of cloud possible.

For the MODIS data, the server supports requesting tiles for a whole month with daily values (preserving the temporal dimension).

The internal landcover executable behind the process takes as input 12 netCDF coverages of MODIS (with daily values) and 12 monthly GeoTIFF cloud-free sentinel-2 imagery with raw band values. It supports generating a classified discrete coverage or a multi-polygon feature collection as a result, with one feature per land cover category. It is invoked in parallel for each tile, and maybe further accelerated using GPUs with several cores.

As soon as all the necessary input data is available to process one tile, the prediction for that tile is executed (using the model which persists in shared memory as long as it has been used recently). As soon as the prediction is complete for a tile, the result is returned.

Due to the parallel nature of the requests/processing, the small pieces of data being requested and processed, the use of GPU acceleration, and the use of efficient and well optimized technology, the client starts receiving the result tiles within 1 or 2 seconds. The client immediately starts displaying the results with a default style sheet and caches the resulting tiles.

Now the user starts zooming in on an area of interest. The lower resolution tiles are still displayed on the globe while waiting for more refined results to come in (requested for a more detailed zoom level / a smaller scale denominator). Soon those show up on the client display and the user starts seeing interesting classification results. If the user zooms back out, the lower-resolution / larger area results are still cached, so the user does not see a black screen.

The user notices that a classification looks off for a particular land cover category. The user goes back in the execution request / workflow editor and tweaks an input parameter that should correct the situation. The client POSTs a new execution request as a result, which results in a new collection response and a new link to generate tiles of the results. The client invalidates the currently cache tiles which no longer reflect this updated workflow. The server validates the workflow immediately because it still has active connections to the input collections used and does not need to validate them again. The new response comes back quickly and the client can display the result again, which looks good.

The landcover process server had cached responses from the previous MODIS and sentinel-2, so it does not need to go back to make those requests again. It simply needs to re-run the prediction model with the new parameters.

The user explores areas of interest at different resolutions of interest and results keep coming in quickly. The user is satisfied with the results and now select a large area to export at a detailed scale. A lot of the results equired for this operation have already been cached during the exploration phase by the client and / or the landcover server. The "batch process" finishes quickly. The user is very happy with OGC API - Processs workflows after having succeeded producing a land cover map in 15 minutes from discovery to the resulting map.

We demonstrated a similar scenario in MOAW project using sentinel-2 data from EuroDataCube / SentinelHub. See JSON process description.

Scenario 2: Custom map rendering (remote process / nested process)

As a slight twist to Scenario 1, the user wishes to render a map server-side using an their own server (but it could just as easily be on any server implementing a maps rendering process) instead of rendering it client-side.

The server has a RenderMap process that takes in a list of layers as input. The result of the process is available either using OGC API - Maps or as map tiles using OGC API - Tiles, in a variety of CRSes and TileMatrixSets.

The discovery process and selection of processes and input is very similar as in Scenario 1, except this time the RenderMap process will be the one to which the client will be POSTing the execution request. The landcover process will become nested process, its output being an input into the RenderMap process, and could be rendered on top of a sentinel-2 mosaic:

{
   "process" : "https://research-beta.org/ogcapi/processes/RenderMap",
   {
      "inputs" : {
         "layers" : [
            {
               "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A",
               "ogcapiParameters" : {
                  "filter" : "scene.cloud_cover < 50 and cell.cloud_cover < 15",
                  "sortby": "cell.cloud_cover(desc)"
               }
            },
            {
               "process" : "https://research-alpha.org/ogcapi/processes/landcover",
               "inputs" : {
                  "modis_data" : { "collection" : "https://usgs.gov/ogcapi/collections/modis" },
                  "sentinel2_data" : { "collection" : "https://esa.int/ogcapi/collections/sentinel2:level2A" }
               }
            }
         ]
      }
   }
}

The RenderMap process may also take in other input parameters, e.g. a style definition.

In a similar manner to Scenario 1, the client will receive a collection description document, this time links to map tilesets and to a map available for the results. The client decides to trigger the processing and request results using OGC API - Maps, and build a request specifying a World Mercator (EPSG:3395) CRS, a bounding box, a date & time, and a width for the result (height is automatically calculated from the normal aspect ratio):

https://research-beta.org/ogcapi/internal-workflows/b357-c0ffee/map.png?crs=EPSG:3395&bbox=-80,40,-70,45&bbox-crs=OGC:CRS84&datetime=2021-01-01T00:0:00Z&width=8192.

Although the client is requesting a WorldMercator map, the RenderMap process implementation might still leverage vector tiles using the GNOSISGlobalGrid tile matrix set, and thus submit multiple requests to the landcover process server, acting in the same way as the client-side renderer in scenario 1.

See JSON process description for our implementation of such a process.

Scenario 3: Publishing the results of a workflow (virtual collections)

The researcher may now want to publish the map as a dedicated and persistent OGC API collection. Through some "collections" transaction mechanism, the client may POST the workflow definition using a dedicated media type for Processes execution requests, and with proper authentication, to e.g. to /collections to create the collection.

The server can execute the processing based on requests received for that collection, but would also cache results to optimize processing, bandwidth, memory and disk resources.

The collection description may also link to the workflow source, making it easy to reproduce and adapt to similar and derived uses.

As new data gets added to the source collections, caches expire and the virtual collection is always up to date. Rather than the providers having to continously run batch processes using up a lot of resources for areas / resolution of interest that will be mostly out of date before any client is interested in the data, they can instead prioritize resources on the latest requests and on most important (e.g. disaster response). The server can also prioritize resources to pre-empted request that will follow the current request patterns when it has free cycles. Such pre-emption could offset the latency in workflows with a larger number of hops.

This can also be done in the backend without users of the API being aware, but offering these explicit capabilities facilitate the reproducibility and re-use.

Scenario 4: Backend workflow and EVI expression (nested process / deploy workflow)

For this scenario, let's assume the landcover process is itself a workflow that leverages other processes. It could e.g. have been deployed using Processes - Part 2: Deploy, Replace, Undeploy by POSTing it to /processes using a dedicated media type for execution requests (implementations can potentially automatically determine inputs and their schemas by parsing the nested processes that are used as well as their inputs, and analyzing the "input" properties defined in the workflow, so uploading a process description is not absolutely necessary).

In addition to the raw sentinel-2 band, the classification algorithm might for example utilize a pre-computed vegetation index, and specify the filtering logic discussed earlier.

landcover process workflow _inputs: modis_data, sentinel2data {datetime} refers to OGC API datetime parameter used when triggering processing _coverageprocessor creates a new coverage based on bands expressions randomForestPredict runs a random forest classification prediction based on a pre-trained model and input coverages

{
   "process" : "https://research-alpha.org/ogcapi/processes/randomForestPredict",
   "inputs" : {
      "trainedModel" : "https://research-alpha.org/ogcapi/models/sentinel2ModisLandCover",
      "data" :
      [
          { "$ref" : "#/components/monthlyInput", "{month}" :  1 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  2 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  3 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  4 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  5 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  6 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  7 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  8 },
          { "$ref" : "#/components/monthlyInput", "{month}" :  9 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 10 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 11 },
          { "$ref" : "#/components/monthlyInput", "{month}" : 12 }
      ]
   },
   "components" :
   {
      "modis":
      {
         "input" : "modis_data",
         "format": { "mediaType": "application/netcdf" },
         "ogcapiParameters" : {
            "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" }
         }
      },
      "sentinel2":
      {
         "input" : "sentinel2_data",
         "format": { "mediaType": "image/tiff; application=geotiff" },
         "ogcapiParameters" : {
            "filter" : "scene.cloud_cover < 50 and cell.cloud_cover < 15",
            "sortby": "cell.cloud_cover(desc)",
            "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" }
         }
      },
      "monthlyInput":
      {
         { "$ref" : "#/components/modis" },
         { "$ref" : "#/components/sentinel2" },
         {
            "process" : "https://research-alpha.org/ogcapi/processes/coverage_processor",
            "inputs" : {
               "data" : { "$ref" : "#/components/sentinel2" },
               "fields" : { "evi" : "2.5 * (B08 - B04) / (1 + B08 + 6 * B04 + -7.5 * B02)" }
            }
         }
      }
   }
}

Our implementation of the RFClassify process works in a similar way, but up until now it has been implemented as a single process integrating Python and Scitkit-learn. This example introduces new capabilities that would make it easier to implement this as a workflow:

Re-usable components to avoid duplicating inputs / nested processes
OGC API parameters allowing to be more specific on how an OGC API collection would be accessed (they would only apply when an OGC API collection is used as an input, and when those parameters are defined for the specific OGC API access mechanism negotiated)
The use of e.g. {datetime} to refer to aspects of how the processing is triggered by the OGC API data access mechanisms

Scenario 5: Point cloud gridifier (landing page output)

In this scenario, a collection supporting point cloud requests (e.g. as .las using OGC API - Tiles) is provided as an input, and the process generates two outputs for it by gridifying the point cloud: an ortho-rectified imagery and a DSM. In order to have access to both outputs, the client uses a response=landingPage instead of a collection description.

{
  "process" : "https://example.com/ogcapi/processes/PCGridify",
  "inputs" : {
     "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },
     "fillDistance" : 100,
     "classes" : [ "ground", "highVegetation" ]
  }
}

The response is an OGC API landing page, with two collections available (one for the ortho imagery and one for the DSM). A client wishing to nest this workflow and use one specific output can specify which to include using the usual "outputs" property:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",
  "inputs" : {
     "dataset" : { "collection" : "https://example.com/ogcapi/collections/osm:roads" },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",
        "inputs" : {
           "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },
           "fillDistance" : 100,
           "classes" : [ "roads" ]
        },
        "outputs" : { "dsm" : { } }
     },
     "preference" : "shortest",
     "mode" : "pedestrian",
     "waypoints" : { "value" : {
        "type" : "MultiPoint",
        "coordinates" : [
           [ -71.20290940, 46.81266578 ],
           [ -71.20735275, 46.80701663 ]
        ]
        }
     }
   }
}

An interesting extension of this use case is to generate the point cloud from a photogrametry process using a collection of oblique imagery at one end, and to use use the process in a another workflow doing classification / segmentation and conversion into mesh, that a client can trigger by requesting 3D content using OGC API - GeoVolumes.

See JSON process descriptions for the Point cloud gridifier and the Routing engine in our implementation of such processes.

Scenario 6: Fewer round-trips (immediate access)

As a way to reduce the number of round-trips, the ability to submit workflows to other end-points has been considered. e.g. in Scenario 1, the client could submit the execution request to /processes/landcover/tiles/GNOSISGlobalGrid instead of to /processes/landcover/execution to immediately receive a vector tileset of the result (which will already contain the resulting tiles templated URL), instead of having to follow links from the returned collection description -> list of vector tilesets -> tileset. This is also useful in demonstrating Workflows in action by posting a workflow execution request directly to a .../map or .../map/tiles/{tileMatrixSetId}/{tileMatrix}/{tileRow}/{tileCol} and receiving (e.g. in Swagger UI).

For a live example of this capability, try POSTing the following execution request to the following end-points:

https://maps.ecere.com/ogcapi/processes/RenderMap/map
https://maps.ecere.com/ogcapi/processes/RenderMap/map/tiles/GNOSISGlobalGrid
https://maps.ecere.com/ogcapi/processes/RenderMap/map/GNOSISGlobalGrid/0/0/0

{
  "process": "https://maps.ecere.com/ogcapi/processes/RenderMap",
  "inputs": {
    "layers": [
      { "collection": "https://maps.ecere.com/ogcapi/collections/SRTM_ViewFinderPanorama" }
    ]
  }
}

More examples (in Annex B) and additional details in draft MOAW discussion paper currently @ https://maps.ecere.com/moaw/DiscussionPaper-Draft3.pdf.

m-mohr commented 2 years ago

(Sorry, I forgot to work on examples and was just reminded once Peter opened the issue. As such my contribution is rather short and a bit incomplete for now.)

It seems there are multiple different base "use cases":

Simply retrieve some data
Process data by providing "low-level" processing instructions (e.g. band math + temporal mean + linear stretching)
Process data just by providing some "high-level" processing instructions (e.g. a landcover process as described above)
Provide a complete processing environment + processing instructions (e.g. a Docker container, I think you all call them application packages?)

All this may also include:

A. Publishing results B. Downloading/Accessing results C. Interchange results across back-ends

In openEO, the focus is on 3 while it seems the previous posts here are focussing more on the parts. This is all not mutually exclusive though. So how can you achieve the use cases above in openEO:

Use Case 1 (data retrieval)

You need to send a load_collection + save_result to the back-end and store the data in a format you wish to get.

Depending on the execution mode you may get different results:

A. You can publish the data using web services, e.g. WMTS, using openEO's "secondary web service" API. B. You can download a single file using synchronous processing or create a STAC catalog with your requested data using batch processing. C. Similarly, you'd create a batch job and then you could load the result_from another back-end (using load_result). This can be automated in code, but doesn't happen automagically yet.

Use Case 2 (low-level processing instructions)

That's the main goal of openEO and that's where it probably shines most. A substantial amount of work has lead to a list of pre-defined processes that can be used for data cube operations, math etc. See https://processes.openeo.org for a list of processes. These can easily be chained (in a process graph) to a "high-level" process, we call them user-defined processes.

The EVI example mentioned above looks like this in "visual mode" (child process graphs not shown):

(Please note the code below is auto-generated from the Editor that is used for the visual mode above. As such the code may not be exactly what an experienced user would write.)

This is the corresponding code from Python:

# Loading the data; The order of the specified bands is important for the following reduce operation.
dc = connection.load_collection(collection_id = "COPERNICUS/S2", spatial_extent = {"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35}, temporal_extent = ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"], bands = ["B8", "B4", "B2"])

# Compute the EVI
B02 = dc.band("B02")
B04 = dc.band("B04")
B08 = dc.band("B08")
evi = (2.5 * (B8 - B4)) / ((B8 + 6.0 * B4 - 7.5 * B2) + 1.0)

# Compute a minimum time composite by reducing the temporal dimension
mintime = evi.reduce_dimension(reducer = "min", dimension = "t")

def fn1(x, context = None):
    datacube2 = process("linear_scale_range", x = x, inputMin = -1, inputMax = 1, outputMax = 255)
    return datacube2

# Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
datacube1 = mintime.apply(process = fn1)
save = datacube1.save_result(format = "GTIFF")

# The process can be executed synchronously (see below), as batch job or as web service now
result = connection.execute(save)

This is the corresponding code in R:

p = processes()

# Loading the data; The order of the specified bands is important for the following reduce operation.
dc = p$load_collection(id = "COPERNICUS/S2", spatial_extent = list("west" = 16.06, "south" = 48.06, "east" = 16.65, "north" = 48.35), temporal_extent = list("2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"), bands = list("B8", "B4", "B2"))

# Compute the EVI
evi_ <- function(x, context) {
  b2 <- x[1]
  b4 <- x[2]
  b8 <- x[3]
  return((2.5 * (b8 - b4)) / ((b8 + 6 * b4 - 7.5 * b2) + 1))
}

# reduce_dimension bands with the defined formula
evi <- p$reduce_dimension(data = dc, reducer = evi_, dimension = "bands")

mintime = function(data, context = NULL) {
    return(p$min(data = data))
}
# Compute a minimum time composite by reducing the temporal dimension
mintime = p$reduce_dimension(data = evi, reducer = mintime, dimension = "t")

fn1 = function(x, context = NULL) {
    datacube2 = p$linear_scale_range(x = x, inputMin = -1, inputMax = 1, outputMax = 255)
    return(datacube2)
}
# Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
datacube1 = p$apply(data = mintime, process = fn1)
save = p$save_result(data = datacube1, format = "GTIFF")

# The process can be executed synchronously (see below), as batch job or as web service now
result = compute_result(graph = save)

This is the corresponding code in JS:

let builder = await connection.buildProcess();

// Loading the data; The order of the specified bands is important for the following reduce operation.
let dc = builder.load_collection("COPERNICUS/S2", {"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35}, ["2018-01-01T00:00:00Z", "2018-01-31T23:59:59Z"], ["B8", "B4", "B2"]);

// Compute the EVI.
let evi = builder.reduce_dimension(dc, new Formula("2.5*(($B8-$B4)/(1+$B8+6*$B4+(-7.5)*$B2))"), "bands");

let mintime = function(data, context = null) {
    let min = this.min(data);
    return min;
}
// Compute a minimum time composite by reducing the temporal dimension
let mintime = builder.reduce_dimension(evi, mintime, "t");

// Stretch range from -1 / 1 to 0 / 255 for PNG visualization.
let datacube1 = builder.apply(mintime, new Formula("linear_scale_range(x, -1, 1, 0, 255)"));
let save = builder.save_result(datacube1, "GTIFF");

// The process can be executed synchronously (see below), as batch job or as web service now
let result = await connection.computeResult(save);

And this is how it looks like in JSON as process (graph):

{
  "process_graph": {
    "1": {
      "process_id": "apply",
      "arguments": {
        "data": {
          "from_node": "mintime"
        },
        "process": {
          "process_graph": {
            "2": {
              "process_id": "linear_scale_range",
              "arguments": {
                "x": {
                  "from_parameter": "x"
                },
                "inputMin": -1,
                "inputMax": 1,
                "outputMax": 255
              },
              "result": true
            }
          }
        }
      },
      "description": "Stretch range from -1 / 1 to 0 / 255 for PNG visualization."
    },
    "dc": {
      "process_id": "load_collection",
      "arguments": {
        "id": "COPERNICUS/S2",
        "spatial_extent": {
          "west": 16.06,
          "south": 48.06,
          "east": 16.65,
          "north": 48.35
        },
        "temporal_extent": [
          "2018-01-01T00:00:00Z",
          "2018-01-31T23:59:59Z"
        ],
        "bands": [
          "B8",
          "B4",
          "B2"
        ]
      },
      "description": "Loading the data; The order of the specified bands is important for the following reduce operation."
    },
    "evi": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "dc"
        },
        "reducer": {
          "process_graph": {
            "nir": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 0
              }
            },
            "sub": {
              "process_id": "subtract",
              "arguments": {
                "x": {
                  "from_node": "nir"
                },
                "y": {
                  "from_node": "red"
                }
              }
            },
            "div": {
              "process_id": "divide",
              "arguments": {
                "x": {
                  "from_node": "sub"
                },
                "y": {
                  "from_node": "sum"
                }
              }
            },
            "p3": {
              "process_id": "multiply",
              "arguments": {
                "x": 2.5,
                "y": {
                  "from_node": "div"
                }
              },
              "result": true
            },
            "sum": {
              "process_id": "sum",
              "arguments": {
                "data": [
                  1,
                  {
                    "from_node": "nir"
                  },
                  {
                    "from_node": "p1"
                  },
                  {
                    "from_node": "p2"
                  }
                ]
              }
            },
            "red": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 1
              }
            },
            "p1": {
              "process_id": "multiply",
              "arguments": {
                "x": 6,
                "y": {
                  "from_node": "red"
                }
              }
            },
            "blue": {
              "process_id": "array_element",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                },
                "index": 2
              }
            },
            "p2": {
              "process_id": "multiply",
              "arguments": {
                "x": -7.5,
                "y": {
                  "from_node": "blue"
                }
              }
            }
          }
        },
        "dimension": "bands"
      },
      "description": "Compute the EVI. Formula: 2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE)"
    },
    "mintime": {
      "process_id": "reduce_dimension",
      "arguments": {
        "data": {
          "from_node": "evi"
        },
        "reducer": {
          "process_graph": {
            "min": {
              "process_id": "min",
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "result": true
            }
          }
        },
        "dimension": "t"
      },
      "description": "Compute a minimum time composite by reducing the temporal dimension"
    },
    "save": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "1"
        },
        "format": "GTIFF"
      },
      "result": true
    }
  }
}

For details about our data cubes and related processes: https://openeo.org/documentation/1.0/datacubes.html For details about common smaller "use cases" see the openEO Cookbook: https://openeo.org/documentation/1.0/cookbook/

Use Case 3 (high-level processing instructions)

Any process that you define you can also store as a high-level process that others can execute and re-use. So the EVI process above could simply be stored and then be executed with a single process call. Then your process ist as simple as:

Which in the three programming languages looks as such:

# Python
datacube = connection.datacube_from_process("evi")
result = connection.execute(datacube)

# R
p = processes()
result = compute_result(graph = p$evi())

// JavaScript
let builder = await connection.buildProcess();
let result = await connection.computeResult(builder.evi());

and in JSON:

{
  "id": "evi",
  "process_graph": {
    "1": {
      "process_id": "evi",
      "arguments": {},
      "result": true
    }
  }
}

This is simplified though, you'd probably want to defined parameters (e.g. collection id or extents) and pass them later.

Use Case 4 (processing environments)

We only cater partially for this. Right now, back-ends can provide certain pre-configured environments to run user code (so-called UDFs). This is currently implemented for Python and R and the environments usually differ by the software and libraries installed. Then you would send your code using run_udf as part of an openEO process graph.

We could extend the openEO API relatively easily in a way that user could push their own environments to the servers, but ultimately this was never the goal of openEO and as such could be covered by another standard.

What I haven't captured yet

Execution of a UDF
Parametrization of a user-defined process
Creation of batch jobs and secondary web services
Loading from external sources
probably more?

m-mohr commented 2 years ago

Sorry, I had the meeting in my calendar for 16:00 CET for whatever reason and thus only heard like the last minutes of the call. Did you conclude on something? Otherwise, happy to join the next telco again.

pvretano commented 2 years ago

@m-mohr nope ... no conclusions yet. @jerstlouis and @fmigneault presented their examples so it would be good if at the next meeting you could present your examples. One outcome of today's meeting was that @fmigneault will try to cast one of @jerstlouis examples in CWL. There will also be a recording of today's meeting available if you want to listen to the meeting. @bpross-52n can you post the recording somewhere when it is available?

fmigneault commented 2 years ago

Following is the conversion exercise example Scenario 5 : RoutingEngine provided by @jerstlouis

The first process is PCGridify. It takes all inputs that where in the nested process from Scenario 5 example, and produces a DSM file from the input point cloud.

{
    "processDescription": {
        "id": "PCGridify",
        "version": "0.0.1",
        "inputs": {
            "data": {
                "title": "Feature Collection of Point Cloud to gridify",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "fillDistance": {
                "schema": {
                    "type": "integer"
                }
            },
            "classes": {
                "schema": {
                    "type": "array",
                    "items": "string"
                }
            },
        }
        "outputs": {
            "dsm": {
                "schema": {
                    "type": "object",
                    "additionalProperties": {}
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "CommandLineTool",
                "baseCommand": ["PCGridify"],
                "arguments": ["-t", "$(runtime.outdir)"],
                "requirements": {
                    "DockerRequirement": {
                        "dockerPull": "example/PCGridify"
                    }
                },
                "inputs": {
                    "data": {
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 1
                        }
                    },
                    "fillDistance": {
                        "type": "float",
                        "inputBinding": {
                            "position": 2
                        }
                    },
                    "fillDistance": {
                        "type": "array",
                        "items": "string",
                        "inputBinding": {
                            "position": 3
                        }
                    }
                },
                "outputs": {
                    "dsm": {
                        "type": "File",
                        "outputBinding": {
                            "glob": "*.dsm"
                        }
                    }
                },
                "$namespaces": {
                    "iana": "https://www.iana.org/assignments/media-types/"
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/dockerizedApplication"
}

The second process is RouteProcessor. It takes the OpenStreetMap feature collection and "some preprocessed DSM" to generate the estimated route in a plain text file.

{
    "processDescription": {
        "id": "RouteProcessor",
        "version": "0.0.1",
        "inputs": {
            "dataset": {
                "title": "Collection of osm:roads"
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "elevationModel": {
                "title": "DSM file reference",
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            },
            "preference": {
                "schema": {
                    "type": "string"
                }
            },
            "mode": {
                "schema": {
                    "type": "string"
                }
            },
            "waypoints": {
                "schema": {
                    "type": "object",
                    "required": [
                        "type", 
                        "coordinates"
                    ],
                    "properties": [
                        "type": {
                            "type": "string"
                        }
                        "coordinates": {
                            "type": "array",
                            "items": {
                                "type": "array",
                                "items": "float"
                            }
                        }
                    ]
                }
            }
        }
        "outputs": {
            "route": {
                "format": {
                    "mediaType": "text/plain"
                }
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "CommandLineTool",
                "baseCommand": ["RoutingEngine"],
                "arguments": ["-t", "$(runtime.outdir)"],
                "requirements": {
                    "DockerRequirement": {
                        "dockerPull": "example/RoutingEngine"
                    }
                },
                "inputs": {
                    "dataset": {
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 1
                        }
                    },
                    "elevationModel": {
                        "type": "File",
                        "inputBinding": {
                            "position": 2
                        }
                    },
                    "waypoints": {
                        "doc": "Feature Collection",
                        "type": "File",
                        "format": "iana:application/json",
                        "inputBinding": {
                            "position": 3
                        }
                    },
                    "preference": {
                        "type": "string",
                        "inputBinding": {
                            "prefix": "-P"
                        }
                    },
                    "mode": {
                        "type": "string",
                        "inputBinding": {
                            "prefix": "-M"
                        }
                    }
                },
                "outputs": {
                    "route": {
                        "type": "File",
                        "format": "iana:text/plain",
                        "outputBinding": {
                            "glob": "*.txt"
                        }
                    }
                },
                "$namespaces": {
                    "iana": "https://www.iana.org/assignments/media-types/"
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/dockerizedApplication"
}

Finally, the RoutingEngine Workflow is defined as follows. It only takes the point cloud and the routing data as input. All other intermediate parameters are "hidden away" for the external user using predefined {"default": <value>} for this workflow implementation.

In the steps section, I applied different names for Workflow level vs Application level inputs to better illustrate how the chaining relationship of I/O is accomplished by CWL.

{
    "processDescription": {
        "id": "RoutingEngine",
        "version": "0.0.1",
        "inputs": {
            "point_cloud": {
                "title": "Feature Collection of Point Cloud to gridify",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "roads_data": {
                "tite": "Collection of osm:roads",
                "schema": {
                    "type": "object",
                    "properties": {
                        "collection": {
                            "type": "string",
                            "format": "url"
                        }
                    }
                }
            },
            "routing_mode": {
                "schema": {
                    "type": "string"
                }
            }
        }
        "outputs": {
            "estimated_route": {
                "format": {
                    "mediaType": "text/plain"
                }
                "schema": {
                    "type": "string",
                    "format": "url"
                }
            }
        }
    },
    "executionUnit": [
        {
            "unit": {
                "cwlVersion": "v1.0",
                "class": "Workflow",
                "inputs": {
                    "point_cloud": {
                        "doc": "Point cloud that will be gridified",
                        "type": "File"
                    }
                    "roads_data": {
                        "doc": "Feature collection of osm:roads",
                        "type": "File"
                    },
                    "routing_mode": {
                        "schema": {
                            "type": "string", 
                            "enum": [
                                "pedestrian",
                                "car"
                            ]
                        }
                    }
                },
                "outputs": {
                    "estimated_route": {
                        "type": "File",
                        "outputSource": "routing/route"
                    }
                },
                "steps": {
                    "gridify": {
                        "run": "PCGridify",
                        "in": {
                            "data": "point_cloud",
                            "classes": { "default": [ "roads" ] },
                            "fillDistance": { "default": 100 }
                        },
                        "out": [
                            "dsm"
                        ]
                    },
                    "routing": {
                        "run": "RouteProcessor",
                        "in": {
                            "dataset": "roads_data",
                            "elevationModel": "gridify/dsm",
                            "preference": { "default": "shortest"},
                            "mode": "routing_mode"
                        },
                        "out": [
                            "route"
                        ]
                    }
                }
            }
        }
    ],
    "deploymentProfileName": "http://www.opengis.net/profiles/eoc/workflow"
}

I would like to had that these examples are extremely verbose on purpose only to demonstrate the complete capabilities of chaining potential. There are no ambiguity on how to chain elements whatsoever, no matter the amount of processes and I/O implied in the complete workflow.

At least half of all those definitions could be automatically generated as we can see that there are a lot of repetition between CWL's type definitions and schema of I/O from OGC API - Processes definitions. The https://github.com/crim-ca/weaver implementation actually allows inferring OGC API - Processes I/O definitions from CWL using those similarities, and I mostly never need to provide any explicit I/O for the OGC API - Processes portion of the payloads.

p3dr0 commented 2 years ago

As mentioned during the telco here is the example fan-out application design pattern using the CWL ScatterFeatureRequirement requirement.

Scatter Crop Application Example

Example of a CWL that scatters the processing from an array of input values.

cwlVersion: v1.0
$graph:
- class: Workflow
  label: Sentinel-2 product crop
  doc: This application crops bands from a Sentinel-2 product
  id: s2-cropper

  requirements:
  - class: ScatterFeatureRequirement

  inputs:
    product:
      type: Directory
      label: Sentinel-2 input
      doc: Sentinel-2 Level-1C or Level-2A input reference
    bands:
      type: string[]
      label: Sentinel-2 bands
      doc: Sentinel-2 list of bands to crop
    bbox:
      type: string
      label: bounding box
      doc: Area of interest expressed as a bounding box
    proj:
      type: string
      label: EPSG code
      doc: Projection EPSG code for the bounding box
      default: "EPSG:4326"

  outputs:
    results:
      outputSource:
      - node_crop/cropped_tif
      type: Directory[]

  steps:

    node_crop:

      run: "#crop-cl"

      in:
        product: product
        band: bands
        bbox: bbox
        epsg: proj

      out:
        - cropped_tif

      scatter: band
      scatterMethod: dotproduct

- class: CommandLineTool

  id: crop-cl

  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/crop-container

  baseCommand: crop
  arguments: []

  inputs:
    product:
      type: Directory
      inputBinding:
        position: 1
    band:
      type: string
      inputBinding:
        position: 2
    bbox:
      type: string
      inputBinding:
        position: 3
    epsg:
      type: string
      inputBinding:
        position: 4

  outputs:
    cropped_tif:
      outputBinding:
        glob: .
      type: Directory

$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.0
schemas:
- http://schema.org/version/9.0/schemaorg-current-http.rdf

Composite two-step Workflow Example

This section extends the previous example with an Application Package that is a two-step workflow that crops (using scatter over the bands) and creates a composite image.

cwlVersion: v1.0
$graph:
- class: Workflow
  label: Sentinel-2 RGB composite
  doc: This application generates a Sentinel-2 RGB composite over an area of interest
  id: s2-compositer
  requirements:
  - class: ScatterFeatureRequirement
  - class: InlineJavascriptRequirement
  - class: MultipleInputFeatureRequirement
  inputs:
    product:
      type: Directory
      label: Sentinel-2 input
      doc: Sentinel-2 Level-1C or Level-2A input reference
    red:
      type: string
      label: red channel
      doc: Sentinel-2 band for red channel
    green:
      type: string
      label: green channel
      doc: Sentinel-2 band for green channel
    blue:
      type: string
      label: blue channel
      doc: Sentinel-2 band for blue channel
    bbox:
      type: string
      label: bounding box
      doc: Area of interest expressed as a bounding bbox
    proj:
      type: string
      label: EPSG code
      doc: Projection EPSG code for the bounding box coordinates
      default: "EPSG:4326"
  outputs:
    results:
      outputSource:
      - node_composite/rgb_composite
      type: Directory
  steps:
    node_crop:
      run: "#crop-cl"
      in:
        product: product
        band: [red, green, blue]
        bbox: bbox
        epsg: proj
      out:
        - cropped_tif
      scatter: band
      scatterMethod: dotproduct
    node_composite:
      run: "#composite-cl"
      in:
        tifs:
          source:  node_crop/cropped_tif
        lineage: product
      out:
        - rgb_composite

- class: CommandLineTool
  id: crop-cl
  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/crop-container
  baseCommand: crop
  arguments: []
  inputs:
    product:
      type: Directory
      inputBinding:
        position: 1
    band:
      type: string
      inputBinding:
        position: 2
    bbox:
      type: string
      inputBinding:
        position: 3
    epsg:
      type: string
      inputBinding:
        position: 4
  outputs:
    cropped_tif:
      outputBinding:
        glob: '*.tif'
      type: File

- class: CommandLineTool
  id: composite-cl
  requirements:
    DockerRequirement:
      dockerPull: docker.io/terradue/composite-container
    InlineJavascriptRequirement: {}
  baseCommand: composite
  arguments:
  - $( inputs.tifs[0].path )
  - $( inputs.tifs[1].path )
  - $( inputs.tifs[2].path )
  inputs:
    tifs:
      type: File[]
    lineage:
      type: Directory
      inputBinding:
        position: 4
  outputs:
    rgb_composite:
      outputBinding:
        glob: .
      type: Directory

$namespaces:
  s: https://schema.org/
s:softwareVersion: 1.0.0
schemas:
- http://schema.org/version/9.0/schemaorg-current-http.rdf

Please check OGC 20-089 section 8.5 Application Pattern and 8.6. Extended Workflows for more information about these examples.

ghobona commented 2 years ago

A 2008 paper listing some workflow languages used in e-Science is at https://www.dcc.ac.uk/guidance/briefing-papers/standards-watch-papers/workflow-standards-e-science

Note that BPMN is also an ISO standard.

Some related engineering reports:

OGC Testbed-14: BPMN Workflow Engineering Report describes a testbed task that used BPMN.
OGC Testbed-14: WPS-T Engineering Report describes a testbed task that used BPMN.
OGC Testbed-13: Workflows ER describes a testbed task that used BPMN.
OWS-6 Geoprocessing Workflow Architecture Engineering Report describes a testbed task that used BPEL.

I'm not suggesting that OGC API - Processes - Part 3 should use BPMN instead of CWL. I am pointing out that there is a case for supporting multiple workflow languages, if possible.

jerstlouis commented 2 years ago

@fmigneault

The https://github.com/crim-ca/weaver implementation actually allows inferring OGC API - Processes I/O definitions from CWL using those similarities

This is related to what I was suggesting in Scenario 3 above:

implementations can potentially automatically determine inputs and their schemas by parsing the nested processes that are used as well as their inputs, and analyzing the "input" properties defined in the workflow, so uploading a process description is not absolutely necessary

I think it could be possible when creating a process from a workflow (through Part 2: Deploy, Replace, Undeploy) to infer the process description, and then potentially add additional metadata (e.g. a title, input descriptions that cannot be inferred, etc.). In this case, the media type of the payload would be a media type specific to the workflow language... (e.g. CWL, OpenEO, execution request extended with the capabilities I initially proposed for Part 3, i.e. OGC API collections and nested process execution request, and those identified by @ghobona ).

This could be done e.g. by separating out the processDescription from the executionUnit and (similar to OGC API - Styles where we first POST a style, and then add metadata to it, but the content-type of the stylesheet is exactly e.g. SLD/SE or MapboxGL Style) -- in this case the media type could be CWL directly. Another example of an execution unit media type could be a Jupyter notebook. This all works in the context of Part 2 - Deploy, Replace, Undeploy, but technically using different workflow languages / chaining could also potentially be supported directly at /execution, so that a new process does not need to first be "deployed" but could be executed ad-hoc as I suggested in Part 3.

fmigneault commented 2 years ago

@jerstlouis
I agree that workflows could technically be generated on the fly with direct POST on /execution, but I personally don't like this approach too much if the process description is also generated "just-in-time" from different combinations of execution media-type/workflows.

I think is would be a major pain point against OGC API - Processes interoperability because there would basically be no way to replicate executions since we are not even sure which process description gets executed. Each implementation could parse the contents in a completely different manner and generate different process descriptions. This is working against the purpose of the standard being developed in my opinion. The advantage of deployment, although it needs extra steps, is that at the very least we obtain some kind of standard description prior to execution that allows us validate if the process to run was parsed correctly.

Can you please elaborate more on the following part? I'm not sure I understand what you propose.

separating out the processDescription from the executionUnit [...] where we first POST a style, and then add metadata to it

Do you mean that there would be 1 "Workflow Engine" process without any specific inputs/outputs, and that each /execution request would need to submit the full CWL (or whichever else) as the executionUnit each time? What would be the point of the process description in this case, since the core element of the process cannot be known as it would be mostly generated from the submitted executionUnit? It feels like the "POSTing of style" is basically doing a process deployment.

jerstlouis commented 2 years ago

@fmigneault

I agree that workflows could technically be generated on the fly with direct POST on /execution, but I personally don't like this approach too much if the process description is also generated "just-in-time" from different combinations of execution media-type/workflows.

The original idea for these ad-hoc workflows in Part 3 is to allow clients to discover data and processes and immediately make use of these, without requiring special authentication privileges on any of the servers. In that context, I imagined that this would involve lower level processes already made available (and described) using OGC API - Processes (with support for Part 3, or support only Core and using an adapter like the one we developed). I am not sure how well this capability could extend to CWL or OpenEO as well, but was just throwing it out as a possibility because I think you had mentioned before that this could make sense.

The idea is not to replace deployment either... Processes or virtual persistent collections could still be created with those workflows, but the ad-hoc mechanism can provide a way to test and tweak the workflow before publishing it and making it widely available.

Do you mean that there would be 1 "Workflow Engine" process without any specific inputs/outputs, and that each /execution request would need to submit the full CWL (or whichever else) as the executionUnit each time?

In the current draft of Part 3, there is always a top-level process (the one closest to the client in the chain), and the execution request is POSTed to that process. The "process" property is only required for the nested processes, and actually this has resulted in confusion when specifying one (optional) top-level process but POSTing the workflow to the wrong process execution end-point.

There could be a "workflow engine" process as you suggest that requires the "process" key even for the top-level process, avoiding that potential confusion. This might also make more sense with CWL if there is not always a top-level OGC API - Process involved at the top of the workflow.

What would be the point of the process description in this case, since the core element of the process cannot be known as it would be mostly generated from the submitted executionUnit?

Sorry, I might have been adding confusion by mixing up two separate things:

a) (Part 2) Deploy, Replace, Undeploy (deploying a process, potentially using a "workflow" as a payload)
b) (Part 3) ad-hoc execution of workflows

In those "Styles" examples and providing executionUnit and processDescription details separately, I was suggesting this mainly for a), i.e. POSTing the executionUnit content directly to /processes (which allows to use different media types specific to its content) in order to deploy a new process without being forced to provide a processDescription (since it can mostly be inferred).

For b) ad-hoc workflows, in the context of Part 3 as originally proposed, it mainly means re-using processes and data collections already deployed, in a more complex high-level workflow. I imagined the same could potentially be done with CWL. Process descriptions are not involved here (except for any processes used internally, whose descriptions are useful to put together the workflow).

pvretano commented 2 years ago

Just a gentle reminder, Part 2 is not called "OGC API - Processes - Part 2: Deploy, Replace, Undeploy" (i.e. DRU) ... It is no longer called "Transactions".

jerstlouis commented 2 years ago

@fmigneault Thank you for adapting those examples in so much details!

To attempt to get the cross-walk going, some first comments of those 3 JSON snippets:

The first two JSON snippets with CWL unit class CommandLineTool allow to define the PCGridify and RouteProcessor processes and cover functionality which had not been considered for Part 3: Workflows and chaining, but which I think fits well within the current Application Package Best Practice and Part 2: Deploy, Replace, Undeploy. The execution unit portion of these is what I suggested could potentially be POSTed directly to /processes with a CWL media type if the process description can be inferred from the CWL directly.
The third JSON snippet with CWL unit class Workflow serves a similar purpose to what was proposed for Part 3: Workflows and chaining, e.g. if "run" can refer to local (or potentially remote) OGC API processes. Such a Workflow execution unit (the CWL directly) is what could potentially be POSTed to execute a workflow in an ad-hoc manner. Just like the execution request-based workflows, it also makes sense to deploy such workflows as a blackbox process taking inputs and generating outputs using Part 2: DRU.
I wonder to what extent the idea of leaving out some aspects of the execution could also be applied to CWL, to also support processing triggered by data access? e.g. allowing to process different areas and resolution of interests as they are requested by the end-user client, allowing data inputs to come in either from an OGC API collection (which leaves open a lot of flexibility on how to retrieve the data), from a direct URL to a file, or from an embedded payload (not requiring either specifically for the process), to leave open negotiation of which OGC API should be used to transfer data from a collection (e.g. to use vector tiles or Features), or which particular format at each hop of the workflow (when involving different servers)?

fmigneault commented 2 years ago

@jerstlouis

I see. Yes, I was confused about the additional process and POSTing aspect of the Workflows.

I agree with you, the examples I provide converting to CWL are the processes what would be dynamically generated if one wants to represent something POSTed as Part 3 on /execution using the Part 2 concepts, which can then be processed equivalently.

Indeed, the run can be a full URI where a remote process is called. This refers more to the ADES/EMS portion of previous testbeds though, since remote deployment of the process might be needed.

Regarding your 3rd point (https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1047426583), CWL allows some dynamic values to be resolved at runtime using inline JavaScript definitions. I believe this could be used for doing what you mention, but I would most probably define a separate process instead since that would make things clearer IMO (more details below - point 4).

Looking back at your examples, I understand a bit more the various use cases presented and I believe it is possible to consider all new proposed functionalities separately to better illustrate concerns.

1. Collection data type

e.g.: Inputs that have this kind of definition:

    "layers": [
      { "collection": "https://maps.ecere.com/ogcapi/collections/SRTM_ViewFinderPanorama" }
    ]

In my opinion, this should be a new type in itself, similar to bounding boxes. I don't think this should be part of Part 3 per se (or at least consider it as a separate feature). It could be something on its own that could work very well work in conjunction with Core or any extension. What executes this definition behind the scene could very well be some "CollectionFetcher" process that acts like an independent Process using either Part 2 or Part 3 methods, which ever the implementer feels is more appropriate.

I believe more details needs to be provided because there are some use cases where some "magic" happens such as when ogcapiParameters filter or sortby are provided. This is more than just crs as for bounding boxes inputs. I remember @pvretano also highlighting this processing ambiguity when he asked why not simply append those as query parameters after the URL. There is some additional logic that handles those parameters that are not replicable.

2. Nested processes (the main Part 3 feature)

e.g.: A definition as follows:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",   <---------- optional (POST `/processes/RoutingEngine/execution`)
  "inputs" : {
     "dataset" : { "collection" : "https://example.com/ogcapi/collections/osm:roads" },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",  <--- required, dispatch to this process
        "inputs" : {
           "data" : { "collection" : "https://example.com/ogcapi/collections/bigPointCloud" },   
           "fillDistance" : 100,
           "classes" : [ "roads" ]
        },
        "outputs" : { "dsm" : { } }   <--- (*) this is what chains everything together
     },

(*)
The specification should make it clear that one and only one output ID is allowed there (can't pick many outputs and plug them into the parent single input this definition is nested under). Given that restriction, this definition seems sufficient IMO to generate the corresponding CWL-based processes in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1047047398

I also think there would be no need for custom-workflow/separate POSTing of executionUnit at execution time if this is what the major portion of Part 3 limits itself to.

3. Components schema to reuse definitions

This refers to provided Scenario 4. I find the reuse of definitions with #/components/` a very nice concept feature. I believe to make this work using an equivalent CWL approach, we need to improve/add some details.

The monthlyInput[2].process that refers to coverage_processor makes sense. It can be handled similarly to above point (2). Is it probably only missing the outputs definition to tell which output to connect to the parent process input. Following that, it would be possible to generate the full workflow automatically.

The other two items from Scenario 4 (modis and sentinel2) are too complicated. Most probably, @jerstlouis, you would be the only one to know how this is handled because there is nothing providing details how this is applied. Contrary to the last element under monthlyInput where a process reference is provided, those define more parameters and no way to know how inputs and outputs are connected between each other. Are they supposed to do a collection call as in point (1)?

4. Expressions

I think the expressions {month} and {datetime} should be avoided for Part 3. (Maybe make that a Part 4 extension?) This is not something that is very obvious nor easy to implement, although it looks conceptually very convenient.

Firstly, datetime is picked from the request query parameter (how about other sources, how to tell?). I don't see why you wouldn't simply substitute the request query value directly in the body when submitting it (since you need to submit it anyway) to avoid the complicated parsing that would otherwise be required.

Second, "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" } shows a specific handling for datetime object. In this case, it works because datetime is assumed to be converted to a datetime object with year, month, etc. properties. What about other kind of handling though? Convert to float, int, split string, etc. There are too many use cases that explodes the scope of Part 3.

For example, in CWL, it is possible to do similar substitutions, but to processes them, a full inline parsing using JavaScript is required, and there are constantly issues related to how parsing must be done for one case or another, for different kind of objects, whether they are nested under some field or another, how to link them all as variables, etc. An example of inline definition is "$(runtime.outdir)" in my examples, but it can get much more convoluted.

I don't think many developers would adopt Part 3 Workflow/Chaining capabilities (in itself relatively simple) when such a big implementation requirement must be supported as well. I think OGC API - Processes should keep it simple. I could see the {datetime} and {month} definition easily replaced by another nested process: http... reference that simply runs a CurrentDate process which returns that datetime as output.

5. New operations

This mostly revolves around Scenario 1 and Scenario 6. The addition of endpoints:

/processes/RenderMap/map/...
/processes/landcover/tiles/...`
etc.?

In my opinion, this also unreasonably increases the scope of Part 3 that should focus only on Workflows (ie: nesting/chaining processes). It feels like a separate feature (somewhat related to other OGC API - XXX standards) that adds extra capabilities to processes, but not themselves relevant for Workflows. It would possible to combine those features simultaneously, sure, but completely different parsing methodologies are needed, which would warrant a different extension.

This is also the portion I find no way to easily convert to a CWL equivalent dynamically, simply because there is no detail about it. It also seem to be the only use case where POSTing of different kinds of Workflows/Media-Type at distinct endpoints is required, which is what brought a lot of confusion in the first place.

jerstlouis commented 2 years ago

Thanks @fmigneault for the additional feedback. I will try to address everything you commented on, please let me know if I missed something.

First, I think what you are pointing out is that there is a range of functionality covered in those scenarios which makes sense to organize into different conformance classes. Which of these conformance classes make it into Part 3 remains to be agreed upon, and I think as @pvretano pointed out is one of the main point of this exercise (although perhaps that was specifically referring to conformance classes for different workflow languages).

Note that even if these conformance classes are regrouped in one Processes - Part 3 specification, an implementation could decide to implement any number of the conformance classes, and potentially none of them would be required. Therefore I suggest we focus first on the definition of these conformance classes, and worry later about how to regroup those conformance classes in one or more specification / document.

In my scenarios 1-6 above, the names in parentheses are the conformance classes that I had suggested previously. I presented these at the July 2021 OGC API - Processes Code Sprint and here is the summary from the key slide which might help put things in perspective:

Envisioned conformance classes:

CollectionOutput: Collection as output ?response=collection
LandingPageOutput: Dataset API Landing Page as output ?response=landingPage
CollectionInput: Collection as input { "collection" : "https://server.com/ogcapi/collections/someCollection" }
RemoteCollection: Remote collection as input { "collection" : "https://example.com/ogcapi/collections/someCollection" }
NestedProcess: Nested process as input { "process" : "https://server.com/ogcapi/processes/someProcess" }
RemoteProcess: Remote process as input { "process" : "https://example.com/ogcapi/processes/someProcess" }
ImmediateAccess: POST to /processes/{processId}/{accesstype} (Features, Tiles, Coverages, Maps)
DeployWorkflow: POST to /processes support for { "input" : "SomeParameter"}

Indeed, the run can be a full URI where a remote process is called. This refers more to the ADES/EMS portion of previous testbeds though, since remote deployment of the process might be needed.

In the conformance classes suggested for Part 3, this refers specifically to NestedProcess and RemoteProcess. With RemoteProcess, there would be no need to first deploy the process, whereas with NestedProcess, a process would need to be deployed first in order to use it in a workflow.

Collection type In my opinion, this should be a new type in itself, similar to bounding boxes. I don't think this should be part of Part 3 per se (or at least consider it as a separate feature).

This is specifically the CollectionInput conformance class. I agree that this bit alone is very useful by itself, but it is also what greatly simplifies the chaining, because it works hand in hand with the CollectionOutput conformance class. CollectionOutput allows accessing the output of a process as an OGC API collection. Any process that accepts a collection input is automatically able to use a nested process (whether local or remote) that can generate a collection output.

I fully agree that CollectionInput is useful by itself, in fact there was a perfect example in Testbed 17 - GeoDataCube where the 52 North team implemented support for a LANDSAT-8 Collection input in their Machine Learning classification process / pygeoapi deployment.

Whether this conformance class is added to OGC API - Processes - Part 1: Core 2.0 or OGC API - Processes - Part 3: Workflows and Chaining however does not really matter.

I believe more details needs to be provided because there are some use cases where some "magic" happens such as when ogcapiParameters filter or sortby are provided. This is more than just crs as for bounding boxes inputs.

In full agreement here, as these are details that need to be worked out with more experimentation. Using OGC API collections leaves a lot of flexibility, some of which might be useful to leave up to the hop end-points to negotiate between themselves, but a filter that further qualifies the collection is a good example of wanting to restrict the content of that collection directly within the workflow.

The datetime parameter use case in this scenario where daily datasets are used to generate a yearly dataset, but for which the process needs to first generate monthly coverages is another good example where the end-user query datetime (yearly) needs to become a monthly request to the MODIS and sentinel-2 requests.

The specification should make it clear that one and only one output ID is allowed there (can't pick many outputs and plug them into the parent single input this definition is nested under).

When a process generates a single output, there should be no need to specify the output (that is already the case in Part 1: Core).
I would be inclined not to completely rule out the possibility of a process accepting as "one" input "multiple" outputs (i.e., a dataset with multiple collections). An example of this might be a process taking in an OpenStreetMap PBF and generating a multi-collection dataset (e.g. roads, buildings, amenities...). The Process Description would need to describe the "input" as multiple feature collections somehow.... Conceptually, it could still be considered "one" input.
In most use cases, this restriction makes sense. But I feel like not having this restriction seems more of a communication / documentation issue in how this would normally be used, vs. a real benefit in preventing the possibility of one process output being a multi-collection dataset.
This is somewhat related to Scenario 5 and the LandingPageOutput conformance class (the ability to access the results of process as a multi-collection dataset / OGC API landing page).

I also think there would be no need for custom-workflow/separate POSTing of executionUnit at execution time if this is what the major portion of Part 3 limits itself to.

I am a bit confused by that comment. executionUnit is a concept of the Application Package Best Practice and related to Part 2: DRU, if I understand correctly. The proposed Part 3 NestedProcess conformance class defines the possibility to include nested process as part of submitting an execution request at /processes/{processId}/execution. The RemoteProcess conformance class allows those process to be on another server (without requiring to first deploy it to the server to which the execution request is submitted).

The monthlyInput[2].process that refers to coverage_processor makes sense. It can be handled similarly to above point (2). Is it probably only missing the outputs definition to tell which output to connect to the parent process input.

The process description for _coverageprocessor in this case would define a single output (the resulting coverage), therefore it is not necessary to specify it (as in the published Processes - Part 1: Core).

Contrary to the last element under monthlyInput where a process reference is provided, those define more parameters and no way to know how inputs and outputs are connected between each other. Are they supposed to do a collection call as in point (1)?

I think the confusion here is caused by the use of the { "input" : {parameterNameHere} } defined in the DeployWorkflow conformance class. This Scenario 4 workflow is intended to be deployed as a process rather than being submitted as an execution request (similar to your 3rd JSON snippet with CWL Workflow unit class), and therefore must be supplied inputs: modis_data and sentinel2_data. Those defined inputs would be replaced by the collections supplied in the Scenario 1 example, which presumably invokes the process that Scenario 4 defines using a workflow. These input : are equivalent to the "inputs" in the CWL Workflow unit class, except that they are inferred from wherever they are used rather than being explicitly listed. i.e. if { "input" : "modis_data" } is used in two places in the workflow, that is the same "modis_data" input to the process being defined by the workflow.

Potentially, those inputs could also be supplied as embedded data to the process created by the workflow, and in that case the ogcapiParameters and format would not be meaningful/used -- the filtering and proper format would have had to be done prior to submitting the data as input to the landcover process defined by the Scenario 4 workflow.

I think the expressions {month} and {datetime} should be avoided for Part 3. (Maybe make that a Part 4 extension?) This is not something that is very obvious nor easy to implement, although it looks conceptually very convenient.

I agree this is more complicated and I just came up with those while trying to put this Scenario 4 example together. In more typical use cases, the {bbox} or {datetime} from the OGC API collection data requests would just flow through to the nested processes / collection inputs. But in this case, we only needed the "year" portion of the datetime, and wanted to re-use the same modis / sentinel2 / monthlyInput components but changing the {month}.

Some of this capability to reference how the OGC API collection data request were made I think would make sense to include as part of the CollectionOutput conformance class (e.g. {datetime} and {bbox}).

The capability to use e.g. {month} (i.e. an arbitrary {templateVariable}) together with the $ref might make sense as part of a ReusableComponents conformance class.

Second, "datetime" : { "year" : { "{datetime}.year" }, "month" : "{month}" } shows a specific handling for datetime object.

You are right this is a specific capability here to be able to specify a monthly request using only the year that was provided by the OGC API request triggering the processing, but specifying a different month. What I wished for while writing this was functions to build an ISO8601 string as the OGC API datetime parameter would expect that JSON does not have, so there is some assumption that this works somehow.

This Scenario 4 example is testing new grounds in terms of the capabilities of execution request-based workflow definitions as explored so far, but despite a few things to iron out I feel it manages to very concisely and clearly express slightly more complex / practical workflows.

I could see the {datetime} and {month} definition easily replaced by another nested process: http... reference that simply runs a CurrentDate process which returns that datetime as output.

I would welcome suggestions in how to better express it. I actually considered whether I needed to define new processes but this was the best balance I could manage Sunday night in terms clarity / conciseness / least-hackish / ease of implementation. These aspects are definitely still Work in Progress :) The idea here is that {datetime} referred to the OGC API data access that triggered the process (CollectionOutput), whereas the ogcapiParameters.dateTime refers to the datetime that will be passed to the input collections form which data is being requested (CollectionInput). I'm not sure how a separate process could help with this, unless you mean using a process to do the job that a JSON function could have done? (which is what I also thought of, but discarded as likely worst off in terms of clarity, conciseness, least-hackish AND ease of implementation).

New operations / This mostly revolves around Scenario 1 and Scenario 6.

I have to clarify here that Scenario 6 is completely different from Scenario 1 in this regard. Scenario 1 and 2 use the OGC API data access capabilities (e.g. Tiles and Maps) on the collection generated by the process, where making a request to the collection using these access mechanisms triggers processing. This is what the CollectionOutput conformance class defines, and as discussed above is a key thing to make it easy to chain nested processes when processes support both CollectionInput and CollectionOutput.

In my opinion, this also unreasonably increases the scope of Part 3 that should focus only on Workflows (ie: nesting/chaining processes).

If we are talking about CollectionOutput, I think it does fit well within Workflows and chaining because it provides an easy way to connect the output of any process as an input to any other process, and it enables the use of Tiles and DGGS zones as enablers for parallelism, distributedness, and real-time "just what you need right now" with hot workflows working on small pieces at a time, rather than batched processing ("wait a long time / use up a lot of resources, and what you get in the end might actually not be what you wanted, might never end up being used, or might be outdated by the time it is used").

If we are talking about ImmediateAccess which is covered by Scenario 6, it is a much less essential capability, but as I explained it is quite useful for demonstration purposes (e.g. to demonstrate a PNG response of a workflow directly in SwaggerUI as a single operation), and to some extent to provide fewer server round-trips (e.g. submitting a workflow and getting a templated Tiles URI in a single step).

It also seem to be the only use case where POSTing of different kinds of Workflows/Media-Type at distinct endpoints is required, which is what brought a lot of confusion in the first place.

Seems like there is still some confusion about POSTing workflows and media types, so I will try to clear this out :)

Scenario 6 / ImmediateAccess proposes the ability to POST workflows to different resource types for demonstration and shortcuts purposes -- much less essential capability which could be defined separately and that I don't mind if we forget about it completely while we discuss all the much more important conformance classes.
In conjunction with Part 2: DRU, a workflow could be POSTed to /processes with different media types (e.g. execution request, CWL...) to deploy a new process
A workflow / execution request could potentially be POSTed to /collections to create a new persistent virtual collection from a workflow / process execution
A workflow / execution request can be POSTed to /processes/{processId}/execution (as in OGC API - Processes - Part 1: Core) to execute it. Different media types could potentially allow using the Workflow CWL unit class here.
Also with /execution, the CollectionOutput and LandingPageOutput proposed for Part 3 introduce a new "execution mode" (instead of sync or async) whereas the immediate response is a collection description or a landing page, and the processing only gets triggered when requesting data from the resulting collection(s).

This is also the portion I find no way to easily convert to a CWL equivalent dynamically, simply because there is no detail about it.

Leaving Scenario 6 aside, and focusing on the CollectionOutput capability (e.g. Scenario 1) whereas making an OGC API data request triggers process execution to generate the data for that response, would there be something equivalent? I don't think there are many details missing other than those in respective OGC API specifications (e.g. Tiles, Maps, Coverages...). The data access OGC APIs specify how to request data from an OGC API collection, and an implementation of Part 3: CollectionOutput is able to feed the data for that access mechanism to return when it is requested from that virtual collection.

One other nice thing about CollectionOutput is that it makes support for visualizing the output from workflows in visualization clients much easier (than e.g. Processes - Part 1: Core) with doing very little specifically to implement process / workflow execution. This capability is e.g. implemented in the GDAL OGC API driver (and thus available in QGIS as well). It was also easily implemented in clients by participants in Testbed 17 / GeoDataCube.

Thanks!

jerstlouis commented 2 years ago

@bpross-52n @pvretano Please add a workflow/chaining label! ;)

mr-c commented 2 years ago

A 2008 paper listing some workflow languages used in e-Science is at https://www.dcc.ac.uk/guidance/briefing-papers/standards-watch-papers/workflow-standards-e-science

FYI: A modern list (that is continually being updated) with over 300 workflow systems/languages/frameworks known to be used for data analysis: https://s.apache.org/existing-workflow-systems

There is another list at https://workflows.community/systems that just started. This younger list aims to be a better classified subset of the big list: only the systems that are still being maintained.

fmigneault commented 2 years ago

@jerstlouis

Envisioned conformance classes:

Nice. I missed the presentation about those.

DeployWorkflow: POST to /processes support for { "input" : "SomeParameter"}

We must be careful not to overlap with Part 2 here. This is the same method/endpoint to deploy the complete process.

[...] it is also what greatly simplifies the chaining, because it works hand in hand with the CollectionOutput conformance class. CollectionOutput allows accessing the output of a process as an OGC API collection. Any process that accepts a collection input is automatically able to use a nested process (whether local or remote) that can generate a collection output.

This made me think that we must consider some parameter in the payload that will tell the nested process to return the output this way. Maybe for example "outputs" : { "dsm" : { } } could be replaced by "outputs" : { "dsm" : { "response": "collection" } }. Otherwise it is assumed CollectionOutput is returned, which is not the default for all currently existing implementations.

When a process generates a single output, there should be no need to specify the output (that is already the case in Part 1: Core).

I agree that could be allowed if the default was to return CollectionOutput, but since processes are not expected to do so by default (from Core), I think the proposed { "response": "raw|document|collection" } addition above would always be required.

I would be inclined not to completely rule out the possibility of a process accepting as "one" input "multiple" outputs (i.e., a dataset with multiple collections). An example of this might be a process taking in an OpenStreetMap PBF and generating a multi-collection dataset (e.g. roads, buildings, amenities...). The Process Description would need to describe the "input" as multiple feature collections somehow.... Conceptually, it could still be considered "one" input. In most use cases, this restriction makes sense. But I feel like not having this restriction seems more of a communication / documentation issue in how this would normally be used, vs. a real benefit in preventing the possibility of one process output being a multi-collection dataset.

I agree. By "multiple outputs", I specifically refer to the variable {outputID} that forms the key in the output mapping. If under that key, an array of collections is returned, this is perfectly fine if the parent input that receives it accepts maxOccurs>1. The reason why I think it should be restricted, is allowing multiple {outputID} at the same time implies there must be a way to concatenate all the outputs together to pass it to the input. Because of the large quantity of different output types, formats and representations, this is not trivial.

I'm not sure how a separate process could help with this, unless you mean using a process to do the job that a JSON function could have done? (which is what I also thought of, but discarded as likely worst off in terms of clarity, conciseness, least-hackish AND ease of implementation).

I think that a process (lets call it DatetimeParser) that receives as input (datetime) value, and returns as output (parsedDatetime) a JSON formed as { "year" : { "{datetime}.year" }, "month" : "{month}" } would do the trick. The parent process that nests DatetimeParser for one of its input would simply chain the returned JSON as the input value. Here, the { "parsedDatetime" : { "response": "raw" } } could be used to highlight that the value is passed as data rather than a document or a collection.

If we are talking about CollectionOutput, I think it does fit well within Workflows and chaining because it provides an easy way to connect the output of any process as an input to any other process

For this (Collection[Inputs|Outputs] working in hand with Workflows), I totally agree. It is the /tiles, /map, etc. new features (ImmediateAccess) that IMO are out of scope for Part 3 workflow chaining. It is again highlighted by your clarification regarding POSTing workflows and media types.

Leaving Scenario 6 aside, and focusing on the CollectionOutput capability (e.g. Scenario 1) whereas making an OGC API data request triggers process execution to generate the data for that response, would there be something equivalent?

I think this is possible to map to CWL definitions dynamically if only Collection[Input|Output] are used. I think I would resolve parsing of collection input using a CollectionHandler process that takes the collection URL and any other additional parameters as JSON. That process would be in charge to call the relevant OGC API operation to retrieve the collection, and return it as output. All existing Processes/Workflows from Part 2 could then dynamically generate a sub-workflow by inserting this CollectionHandler when { "collection": ... } is specified as execution input. In the same manner, Tiles, Maps, Coverages, etc. handlers would be distinct CWL parser/handlers. I prefer to have many sub-processes in a large Workflow chain that accomplish very small tasks to convert data in various manners, rather then having OGC API itself have to embed custom handling for each new input/data variation.

I think it is important to keep extensions separate for their relevant capabilities, although they can work together afterwards. This is because, realistically, implementers that try to conform to any new Part should try to implement most of it rather than handpick conformance classes under it. Otherwise, there is no point to have a Part in the first place.

jerstlouis commented 2 years ago

@fmigneault

We must be careful not to overlap with Part 2 here. This is the same method/endpoint to deploy the complete process.

The POST operation to /processes is defined by part 2. The DeployWorkflow conformance class would define that a workflow is a valid payload for Part 2, and that { "input" : {someparameter} } is how to define an input to a workflow deployed as a process.

This made me think that we must consider some parameter in the payload that will tell the nested process to return the output this way.

Well the idea here is that the end-points of any particular hop of that workflow would be the ones deciding whether CollectionOutput is used or not, based on conformance support. It is not required that they do so, e.g. if the Processes server does not support CollectionOutput, Processes - Core could be used and requests could be made using sync or async execution mode -- there is no assumption that one or the other is used.

I agree that could be allowed if the default was to return CollectionOutput, but since processes are not expected to do so by default (from Core), I think the proposed { "response": "raw|document|collection" } addition above would always be required.

Not necessary as I just pointed out, and raw vs. document is gone with https://github.com/opengeospatial/ogcapi-processes/pull/272 (2.0?).

It is the /tiles, /map, etc. new features (ImmediateAccess) that IMO are out of scope for Part 3 workflow chaining. It is again highlighted by your clarification regarding POSTing workflows and media types.

To make things super clear:

CollectionOutput allows to request a ?response=collection which will return a collection description with links to access mechanisms, and then potentially request tiles to trigger results (if Tiles is supported), e.g. https://research-alpha.org/ogcapi/internal-workflows/600d-c0ffee/tiles/GNOSISGlobalGrid/{tileMatrix}/{tileRow}/{tileCol}.mvt in Scenario 1.

ImmediateAccess allows to both POST the workflow and request a tile at the same time, or to POST a workflow and get a tileset right away, as in Scenario 6. e.g. POST workflow to https://maps.ecere.com/ogcapi/processes/RenderMap/map , https://maps.ecere.com/ogcapi/processes/RenderMap/map/tiles/GNOSISGlobalGrid or https://maps.ecere.com/ogcapi/processes/RenderMap/map/GNOSISGlobalGrid/0/0/0 .

ImmediateAccess is a nice to have for demonstration and skipping HTTP roundtrips. I don't mind if it doesn't end up in Part 3. CollectionOutput is a key capabiliy proposed for Part 3.

I prefer to have many sub-processes in a large Workflow chain that accomplish very small tasks to convert data in various manners, rather then having OGC API itself have to embed custom handling for each new input/data variation.

Well one of the important ideas with the CollectionInput / CollectionOutput conformance classes is to leave flexibility to make workflows as generic and re-usable as possible with different OGC API implementations. For example, one might re-use the exact same workflow with different servers or data sources but in practice some will end-up exchanging data using DGGS, others with Tiles, others with Coverages; or another will negotiate netCDF, while another will negotiate Zarr, or GRIB. And the workflow does not need to change at all to accommodate all of these.

It also leaves the workflow itself really reflecting exactly what the user is trying to do: apply this process to these data sources, feed its input to this other process, and all the exchange and communication details are left out of the workflow definition for negotiation by the hops.

Of course any implemention of this is free to convert this in the back-end to smaller tasks and sub-processes invocations internally.

I think it is important to keep extensions separate for their relevant capabilities, although they can work together afterwards. This is because, realistically, implementers that try to conform to any new Part should try to implement most of it rather than handpick conformance classes under it. Otherwise, there is no point to have a Part in the first place.

I think there are different opinions about this throughout OGC. With the building blocks approach, I believe that the fundamental granularity that matters for implementation is the conformance classes, whereas the parts are just a necessary organization of the conformance clases into specification documents for publication and other practical reasons. Taking OGC API - Tiles - Part 1: Core as an example, there is definitely no expectation that any implementation will implement all of its conformance classes. So I disagree that handpicking conformance classes to implement is a bad thing, just like handpicking which OGC API / parts one implements in an OGC API implementation is not a bad thing.

More importantly, I think the modularity of OGC API building blocks makes it easy to start by implementing one or more conformance class, and gradually add support for additional ones based on practical needs and resources available.

jerstlouis commented 2 years ago

I was thinking that we could define a WellKnownProcess that allows executing command line tools with an execution request workflow, similar to the approach used in your example using CWL to define base processes @fmigneault :

Scenario 7

This would be POSTed to /processes (Part 2: DRU) to create the PCGridify process in Scenario 5

{
   "process" : "http://example.com/ogcapi/processes/ExecuteCommand",
   "inputs" : {
      "command" : "PCGridify",
      "requirements" : {
         "docker" : { "pull": "example/PCGridify" }
      },
      "stdin" : { "input" : "data", "format": { "mediaType": "application/vnd.las" } },
      "arguments" : [
         "-fillDistance",
         { "input" : "fillDistance", "schema" : { "type" : "number" } },
         "-classes",
         { "input" : "classes", "schema" : { "type" : "array", "items" : { "type" : "string" } } },
         "-orthoOutput",
         "outFile1"
      ]
   },
   "outputs" :
   {
      "stdout" : {
         "output" : "dsm",
         "format": { "mediaType": "image/tiff; application=geotiff" }
      },
      "outFile1" : {
         "output" : "ortho",
         "format": { "mediaType": "image/tiff; application=geotiff" }
      }
   }
}

Realizing that we also probably need this { "output" : {outputName} } in the DeployWorkflow conformance class to support returning multiple outputs and name outputs from a workflow deployed as a process.

fmigneault commented 2 years ago

@jerstlouis If the CWL nomenclature is used, I think it would be better to simply embed it directly without modification (similar to executionUnit in my examples). Using https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 representation, I don't see the advantage of placing everything under inputs/outputs with extra input/output sub-keys for each item to tell which are the "real" inputs of the process. It looks like an hybrid of the CWL and traditional process description, which will just make it harder to parse in Part 2.

jerstlouis commented 2 years ago

@fmigneault This is to allow the Part 3 execution request approach / DeployWorkflow to work as a Content-Type option for deploying workflows as a process with Part 2, will a WellKnown process that can execute a command line tool.

It is not a process description, but an execution request as currently defined in OGC API - Processes - Part 1: Core using the extensions defined in Part 3 DeployWorkflow conformance class ("input" and "output"). The process description for the resulting PCGridify process could be inferred from its inputs and outputs and generated automatically. There is nothing CWL in there except for the inspiration from your example and the docker pull requirements :).

One could still POST CWL instead of this execution request workflow of course to deploy a process, or an application package that bundles a process description + CWL in the executionUnit, as different supported Content-Types to deploy processes.

pvretano commented 2 years ago

@jerstlouis something seems wonky here! There should be no need for a "DeployWorkflow". Whether the execution unit of a process is a Docker container, or a Python script or a CWL workflow, that should not matter. All processes should be deployed the same way (i.e. POST to the /processes endpoint as described in Part 2). I am confused.

jerstlouis commented 2 years ago

@pvretano

Whether the execution unit of a process is a Docker container, or a Python script or a CWL workflow, that should not matter. All processes should be deployed the same way (i.e. POST to the /processes endpoint as described in Part 2).

In full agreement with that. We might need different media types or JSON profiles for OGC API - Processes execution requests, for CWL, and for application packages, for this.

What I call the Part 3 - DeployWorkflow conformance class is:

The definition of a particular content type that defines a workflow which can be used in conjunction with Part 2 to deploy a process (execution request which may include some of the extensions defined in the other Part 3 conformance classes).
Specifically this "input" and "output" capability allowing to return one or more output from the workflow, and define inputs to the workflow in those execution requests, for workflows which are not meant to be executed directly, but to be used to define a new process.

So it is those new properties to define inputs & outputs, plus a particular Content-Type for a Part 2 Deploy operation.

Will Part 2 have different conformance classes for different Content-Types? (e.g. like the different Tiles encodings conformance classes). There is already one for OGC Application Package.

If not this DeployWorkflow conformance class, which conformance class could define the capability to define "input" and "output" of the workflow itself for using the workflow as a process, rather than a ready-to-execute execution request? It could be potentially be an Execution Request Deployment Package in Part 2 instead.

NOTE: Different media types for CWL, execution request than for application package are in the context of NOT using the OGC Application Package conformance class defined in Part 2. It is also a possibility to include the execution request-style workflow (just like CWL) in the execution unit of an application package.

Personally I find that the process description is something that the server should generate, not be provided as part of a Part 2 deployment, because it includes information about how the processes implementation is able to execute things (e.g. sync/async execution mode), and it may be able to accept more e.g. formats than the executionUnit being provided, and because most of the process description can often be inferred from the executionUnit alone. Therefore I don't like the current application package approach very much, and would prefer directly providing the executionUnit as the payload to the POST.

fmigneault commented 2 years ago

@jerstlouis I am also confused. You are referring to an execution request but simultaneously saying that you POST on /processes? As @pvretano mentions, I don't think Part 3 should POST any differently than Part 2 does by already accommodating different kind of execution units. If anything, Part 3 should try to work in concert with Part 2, not redefine similar concepts on its own.

jerstlouis commented 2 years ago

@fmigneault Sorry for the confusion and not communicating clearly enough...

The workflows definitions initially proposed for Part 3 are defined by extending the OGC API - Processes - Part 1: Core execution request schema (and adding e.g. NestedProcesses, CollectionInput, etc.).

One of those additions proposed is the ability to define inputs and outputs to the workflow itself within that execution request, e.g. by using { "input" : "data" } in one or more inputs of a process being invoked within the workflow (data is now an input to that process that the workflow defines when it is deployed).

I agree with both of you that processes defined using Part 3 workflows should not be POSTed any differently than Part 2 (other than a distinct Content-Type, if not using the Application Package conformance class).

Definitely the intent to work in concert with Part 2 and not redefine similar concepts.

There are two ways that Part 2 could accommodate different types of processes being deployed:

The Application Package conformance classes supports multiple types of execution units
Since the Application Package is a distinct conformance class (and is a relatively recent addition to Part 2), I understood the intent to be that Part 2 could also support other payloads besides OGC Application Packages, e.g. directly POSTing a Python file, or a CWL workflow, or a Part 3 workflow, or a Jupyter notebook (each with a different Content-Type).

If any of this is still not clear, happy to jump on Skype with either or both of you and @pvretano :)

fmigneault commented 2 years ago

@jerstlouis Although you say it should not deploy differently, it seems that a whole new deployment mechanism is required by this proposition. This is working against an existing and well defined extension that has for main focus to handle process deployment.

This is what I meant previously regarding the importance of grouping stuff in Part 3 that seems to combine too many things. It feels like it somehow forces you to consider it as something on its own rather than working together with existing or future parts. The biggest issue I have with the scope of proposed Part 3 as it stands is that custom behaviours are added for every new problem that we encounter, which many look very easy to resolve with separate/dedicated processes, whether builtin or deployed using Part 2.

Following up on your other points:

Personally I find that the process description is something that the server should generate, not be provided as part of a Part 2 deployment, because it includes information about how the processes implementation is able to execute things (e.g. sync/async execution mode), and it may be able to accept more e.g. formats than the executionUnit being provided,

It is not because something is POSTed with a given payload that the server must absolutely abide to it to the letter. The server can extend or leave out things it wants to according to its own capabilities. The resulting process description is therefore generated by the server, but it allows the user deploying it to provide additional recommendations regarding metadata to better define the process.

because most of the process description can often be inferred from the executionUnit alone. Therefore I don't like the current application package approach very much, and would prefer directly providing the executionUnit as the payload to the POST.

I have use cases where I need to add process description details because it is not sufficient with the CWL itself to define everything. For example, CWL can define an array input that is translatable to array input in OGC API, but I can enforce more explicitly that this array input should have a dimension in range [2, 100] using the process description for the operation to work. Therefore, it is important to keep processDescription and executionUnit separate.

There are two ways that Part 2 could accommodate different types of processes being deployed: The Application Package conformance classes supports multiple types of execution units [...]

This is exactly what it was intended to do. The contents of executionUnit can be anything you want (see additionalProperties: true in ogcapppkg.yaml). There is nothing new to accommodate.

It could be valid to define a more explicit variant of executionUnit that corresponds to a workflow process itself (the DeployWorkflow you mention), similar to how ogcapppkg.yaml config provides a specific example, but it doesn't make sense to shuffle around all the payload fields already defined in processDescription and executionUnit for 1 new method while all others can work with the current structure.

jerstlouis commented 2 years ago

@fmigneault

Although you say it should not deploy differently, it seems that a whole new deployment mechanism is required by this proposition. This is working against an existing and well defined extension that has for main focus to handle process deployment.

That is not what I proposed at all. The Application Package is currently only one potential payload for Part 2 -- that is why it is a separate conformance class, and until recently it was not part of Part 2 but was a Best Practice. This could change if the consensus is that ALL processes deployment must be done using Application Packages, but in that case it should be moved to the main Deploy, Replace, Update conformance class of Part 2 (and that is not currently the case).

I have use cases where I need to add process description details because it is not sufficient with the CWL itself to define everything.

I understand that there is a use case for this. One approach is the Application Package which includes a process description + an execution unit. Another potential approach, is to first POST the payload directly (e.g. CWL workflow), and then add additional metadata by POSTing to a sub-resource of the created process (that is the approach I mentioned used in OGC API - Styles).

There is nothing new to accommodate. The contents of executionUnit can be anything you want (see additionalProperties: true in ogcapppkg.yaml). There is nothing new to accommodate.

I understand that, and yes it can work perfectly fine with the execution request syntax proposed for Part 3. But currently, it would also be valid to use a workflow definition (whether in CWL or in execution request style) directly as the POST payload for Part 2.

@pvretano I think there is persisting confusion about the nature of Application Packages within Part 2: Deploy, Replace, Update. Up until very recently, application packages were not even part of Part 2. They could become the mandatory way to deploy processes if that is the consensus, but currently they are not. There would be no point in Application Package being a separate conformance class of Part 2 if they are mandatory and the required Content-Type for the POST operation to /processes.

it doesn't make sense to shuffle around all the payload fields already defined in processDescription and executionUnit for 1 new method while all others can work with the current structure.

Those fields are not at all from Process Description, they are from the Execution Request. They both have inputs and outputs, so I understand that this can be confusing. I completely omitted the process description (because it can be inferred), and only provided the Part 3 Workflow execution request as the execution unit. I did not include anything from the Application Package's executionUnit field either, though the names of some inputs from the ExecuteCommand WellKnownProcess might be inspired by your examples.

m-mohr commented 2 years ago

Wow... and sorry, but I won't be able to follow these discussions. The amount of text you are writing is far from what I can cope with in the (little) amount of time that I have available for such discussions and tasks. Happy to present our use cases that I've posted above in the next meeting though.

fmigneault commented 2 years ago

@jerstlouis

That is not what I proposed at all. The Application Package is currently only one potential payload for Part 2 -- that is why it is a separate conformance class, and until recently it was not part of Part 2 but was a Best Practice. This could change if the consensus is that ALL processes deployment must be done using Application Packages, but in that case it should be moved to the main Deploy, Replace, Update conformance class of Part 2 (and that is not currently the case).

I beg to differ. Application Package is definitely referenced and mentioned multiple times in the requirements: https://github.com/opengeospatial/ogcapi-processes/blob/master/extensions/deploy_replace_undeploy/standard/sections/clause_6_deploy_replace_undeploy.adoc#adding-a-new-processes-to-the-api-deployprocess

It also has been part of the extension as far back as draft 3 (though named Transactions) https://github.com/opengeospatial/ogcapi-processes/tree/1.0-draft.3/extensions/transactions/openapi/schemas This dates way before the recent Best Practices document that made in more official, and before Parts concept were even introduced in this repository, including Part 3: Workflows that did not exist at all.

This could change if the consensus is that ALL processes deployment must be done using Application Packages

This is exactly my point. It is the consensus to allow any contents (due to different needs by different methodologies) that lead to Application Packages to be defined this way with versatile contents in the executionUnit part.

I think you are the only one confused by the nature of Application Packages that was defined this same way for a very long time.

jerstlouis commented 2 years ago

@fmigneault

Application Package is definitely referenced and mentioned multiple times in the requirements:

That first link you share says "(e.g. OGC Application Package)" -- OGC Application Package is one example of what can be the payload of that POST operation.

In the DRU conformance class, there are recommendations (but not requirements) for deploy:

If a process can be represented for the intended use as an OGC Application Package, implementations should consider supporting the OGC Application Package encoding for describing the process to be added to the API.

and replace:

If a process can be described for the intended use as an OGC Application Package, implementations should consider supporting the OGC Application Package encoding for describing the replacement process.

There is also an explicit permission:

A server may support any processes description encoding in the body of a HTTP POST operation.

Side note: I find it a bit confusing to call the payload of a deployment operation a process description. Even with an Application Package, the process description is just one part of the payload (the other being the execution unit). I like to think of it as the process itself being deployed, whatever that may entail (and as we discussed, the description can potentially be automatically generated by the server).

Regarding:

This dates way before the recent Best Practices document that made in more official, and before Parts concept were even introduced in this repository, including Part 3: Workflows that did not exist at all.

You are right, the Application Package Conformance class was there in that Draft 3 and all the way back to the first commit. The Parts concept was there though -- "Transactions" was Part 3 at the time before the reshuffle. I was under the impression that the Application Packages were originally detailed only in the best practice, apologies. It's possible that I remember an older draft from before the initial commit, or that I made this up entirely.

This is exactly my point. It is the consensus to allow any contents (due to different needs by different methodologies) that lead to Application Packages to be defined this way with versatile contents in the executionUnit part.

Still, my understanding is that OGC Application Package is just one potential Content-Type for Part 2: DRU. @pvretano Could you please confirm? If my understanding is not correct, several things in the specification needs to be changed to make this clearer, such as moving the OGC Application Package conformance class content to the main Deploy, Replace, Update conformance class.

I think you are the only one confused by the nature of Application Packages that was defined this same way for a very long time.

@fmigneault See the following text in the overview:

This extension does not mandate that a specific processes description language or vocabulary be used. However, in order to promote interoperability, this extension defines a conformance class, OGC Application Package, that defines a formal process description language encoded using JSON. A recommendation is made later in this specification that all implementations of this extension support the OGC Application Package.

One potential use case for posting a different Content-Type which I think the current apppkg might not handle well is to securely POST a large compressed binary file (e.g. including a docker image), rather than a link to a publicly available resource. I don't think making the apppkg the only possible way to go about creating a processes is a good idea, and I don't believe that is what the specification says right now, though we seem to be in stark disagreement about this :)

fmigneault commented 2 years ago

@jerstlouis That recommendation does also say:

If a process can be represented for the intended use as an OGC Application Package, implementations should consider supporting the OGC Application Package encoding for describing the process to be added to the API.

Given that it would be much easier to embedded the existing and known structure of a processDescription+CWL-like content in the existing ogcapppkg definition, rather then implement a new parsing methodology, I am indeed following the spec by recommending to reuse this approach.

If you were POSTing completely different metadata, I would agree with you that a new Content-Type would be appropriate, but the example in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 is massively CWL-inspired. I think we can definitely avoid having multiple standards accomplishing the same thing without working against each other in this case.

jerstlouis commented 2 years ago

@fmigneault

That recommendation does also say:

It's good that we now agree that it is a recommendation and not a requirement to use apppkg, and the use of different Content-Type is possible! :)

I am indeed following the spec by recommending to reuse this approach.

An implementation that does not implement the recommendation would still be a perfectly conforming implementation. It's recommended to support an execution unit within an application package, but it would also be OK to additionally support posting it directly as the payload. That applies regardless of whether we are talking about a workflow definition in CWL, OpenEO or Part 3 execution request workflow. The recommendation is already there in the spec, so in this issue, we really should focus strictly on the proposed workflow definition notations (but consider the use cases of defining a process with Part 2 using them). It does not matter whether they are POSTed directly, or as part of an application package's execution unit, but I really wanted to dispel the misunderstandings or confusion that we had going for a while there.

Another Content-Type could make sense if the process description is not needed, or to define a new compressed payload format which e.g directly embeds the docker image or commandline tool, and includes a minimal set of generic metadata without needing to repeat the inputs and outputs when they can be inferred from the execution unit.

If a new process package format comes along and eventually gets more traction in communities, in a future version of the spec, the recommendation as to which Content-Type should be implemented for interoperability could potentially change without breaking anything.

rather than implement a new parsing methodology

I am not clear what you mean by new parsing methodology. Everything I suggested was JSON, so it can easily be parsed using any JSON parsing approach / library.

I have not proposed any new schema either, everything is based on the Process Execution schema defined by OGC API - Processes - Part 1: Core, with the extensions that I had initially suggested at the OGC API Hackathon in London in June 2019 and documented in opengeospatial/NamingAuthority#47, further developed at the January 2020 ESIP / OGC sprint, explored in the Modular OGC API Workflows project in collaboration with the SWG and collaborators, and being documented in the draft Part 3.

the example in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 is massively CWL-inspired.

My inspiration for that example was simply that I had said here that defining a base process using a command-line tool (as opposed to an already deployed local or remote process) was not initially considered for Part 3 workflows using the execution request syntax. But then I realized that this could easily be done using a WellKnownProcess, and thought I would show how this could be done, as this issue is intended to be a cross-walk exercise between CWL / OpenEO and the approach I proposed for Part 3.

I really want to point out however that in putting this example together I only had to introduce a single new element, and that is the "output" : ... which I realized was missing for the Deploy(able)Workflow class envisioned before. It was also missing from Scenario 4, and it could be added there, but I figured that in the case of a workflow defining a process which does not explicitly defines any output, it could automatically inherit the outputs of the top process it invokes. Or we could enforce the need to explicitly use "output", in which case we would need add this to Scenario 4 as well.

I encoded the command and docker requirements as "inputs" to the ExecuteCommand process, but that is really the only thing that I mapped from your CWL example. Saying that it is massively CWL-inspired is a bit of a stretch. If you find that it looks so much like CWL, then it must be a case of convergent evolution and we must all be doing something right :) That must mean that we are making progresss towards the goal of this issue!

We may need to accept that there will not be one workflow notation to rule them all, but different options. I think that the number of possiblities mentioned above here and here hint that leaving flexibility is a good thing, as do most of the OGC API specifications in regards to resource representation / content type, so I certainly feel that is the right way to go.

I think we can definitely avoid having multiple standards accomplishing the same thing

I understand that multiple ways of doing the exact same thing sometimes become an obstacle to interoperability, but I think we clearly have a different perspective on this. You have an implementation of CWL, we have an implementation of Part 3 execution request workflows. How I see it is that those execution request workflows can easily express complex workflows simply by adding these extensions to the execution request already defined in Part 1.

At the same time, it can address many use cases (Scenarios 1-6) with what we feel is really a novel approach to processing workflows that is both greatly simplified and breaking new grounds in terms of being client-driven and data-access-triggered. So far, it is not clear whether similar scenarios can be as easily addressed with CWL, or at least there are quite different design goals, which could easily justify the two approaches.

We also spent a tremendous amount of efforts and resources on researching and implementing those flexible modular workflows over the past couple years. You and others have also invested a lot of efforts in the use of CWL. We both surely have biases towards one or the other.

I think that the two approaches can co-exist without being detrimental to interoperability. It also seems that there is a simple way to convert one to the other for scenarios that both can naturally handle well.

without working against each other in this case.

We (or CWL vs. execute request workflows specifications) should definitely not be working against each other -- that is certainly not what I'm trying to do here! The goal of this issue is to cross-walk the similarities and differences in capabilities, so I am just trying to do that and try to ensure we understand each other's use cases and examples as well as possible, and clarify any misunderstandings that we may have.

Thanks!!

fmigneault commented 2 years ago

I am not clear what you mean by new parsing methodology. Everything I suggested was JSON, so it can easily be parsed using any JSON parsing approach / library.

Regarding this, I was specifically referring to https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 sample, where command, requirements, arguments, etc. are all completely derived from CWL. A POST request with this format would force an implementation that already supports CWL perfectly, to parse the payload and rewrite the corresponding CWL that could have been POSTed in the proper format in the first place.

I have not proposed any new schema either, everything is based on the Process Execution schema defined by OGC API - Processes - Part 1: Core

I agree the proposition matches the schema from Core, but the problem I perceive with this format is that it is not parsed only as Core. There is additional steps to be taken for the service to correctly understand the content as they were intended. For example, the command parameter that is passed is not actually an input named command with a given type, so it shouldn't be in inputs at all. The "input" named argument is not itself an input, but rather lists the actual inputs (ie: fillDistance and classes) that are identified with as sub-input field. To obtain the resulting process description from that deployment schema, extra parsing steps are required, and they are prone to cause problems for existing services that could misinterpret the contents.

massively CWL-inspired is a bit of a stretch

It definitely is inspired. I can see all the fields coming from CWL and listed in executionUnit of my examples (https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1047047398), but rewritten in a different structure.

I would much prefer that you POST a deployment that resembles a real process description (using another Content-Type), and let the parsing of this content generate the equivalent CWL, rather than POST the pseudo-CWL form https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824.

As example, taking the Execution body from Scenario 5 in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1046682392, and inferring what the Workflow process description might look like from it (not perfect, just to demonstrate), the following could directly be POSTed to /processes:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",
  "inputs" : {
     "dataset" : { "formats": [ { "mediaType": "application/x-osm-roads" } ] },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",
        "inputs" : {
           "data" :  { "formats": [ { "mediaType": "application/x-point-cloud" } ] },
           "fillDistance" : { "schema": { "type": "integer" }},
           "classes" : { "schema": { "type": "array", "items": {"type": "string", "enum":  ["roads", "sidewalks"] }}},
        },
        "outputs" : { "dsm" : { "formats": [ { "mediaType": "application/x-dms" } ] } }
     },
     [...]
  }, 
  "outputs": { [...] }
}

Or we could enforce the need to explicitly use "output", in which case we would need add this to Scenario 4 as well.

For clarifying the intent in the process description for a user that doesn't necessarily has all the knowledge about each part of the workflow, or even for another user that was not the one that deployed the process (therefore cannot even be aware of what happens behind the scene), I think this would be preferable. I am not against the Deploy body to omit the information if it can be inferred for single-output processes, but the resulting process description should state the output explicitly to avoid ambiguity (basically, to validate the server understood the submitted deployment payload correctly, à la "what you see is what you get").

You have an implementation of CWL, we have an implementation of Part 3 execution request workflows. How I see it is that those execution request workflows can easily express complex workflows simply by adding these extensions to the execution request already defined in Part 1. [...] We may need to accept that there will not be one workflow notation to rule them all, but different options [...]

Just to be clear. I have nothing against Part 3 and workflows submitted by Execution request, nor am I trying to diminish the efforts in it. As I have previously said, there are many proposals I find very interesting and that can work on their own. Where I am against is when a Part 2 deployment is involved with request contents that strongly imitates CWL recommended by Best Practices, although not actually making use of it.

I want to make it easier for developers that decide to implement both Part 2 and Part 3 to be able to reuse the same definitions whenever possible, since there is potential to reuse the same Application Package with CWL schema for both purposes. That would make the API more concise and generally easier to understand by everyone, and saves the extra work to parse different deployment payloads. I, for that matter, would need very little extra work to support Part 3 as well if Deploy uses the current CWL approach, while example https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 would cause a lot of problems and exceptions to be handled. So, just like you said "The goal of this issue is to cross-walk the similarities and differences in capabilities", I'm pointing out these similarities I observe and that we should take advantage of, while trying to pad the lacking points from the original proposition (e.g.: adding the output specification) to ensure they work well together. My whole intention in my comments is to build a bridge between Part 2 and Part 3, not to deepen the gap between them.

jerstlouis commented 2 years ago

@fmigneault Thank you for the clarifications.

Regarding this, I was specifically referring to https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824 sample, where command, requirements, arguments, etc. are all completely derived from CWL.

The intent of that example was to share that you actually could define a base process using the Part 3/Workflows sytax, not to ask that implementations support this syntax instead of CWL. For an implementation that does not already use CWL and does not plan on supporting CWL though, it might be an interesting potential option to define base processes.

For example, the command parameter that is passed is not actually an input named command with a given type, so it shouldn't be in inputs at all.

But they are for that ExecuteCommand process! Maybe that is where some of the confusion came from... I hypothesized an ExecuteCommand well-known process that would have a ProcessDescription like this:

{
  "id" : "ExecuteCommand",
  "title" : "Command execution process",
  "version" : "1.0.0",
  "jobControlOptions": [
    "sync-execute", "workflow-collection"
  ],
  "outputTransmission" : [ "value" ],
  "description" : "This process executes a command.",
  "links" : [ {
    "href" : "https://example.com/ogcapi/processes/ExecuteCommand/execution",
    "rel" : "http://www.opengis.net/def/rel/ogc/1.0/execute",
    "title" : "Execution endpoint"
  } ],
  "inputs" : {
    "command" :
    {
      "title" : "Command",
      "description" : "The command to execute",
      "minOccurs" : 1,
      "maxOccurs" : 1,
      "schema" : { "type" : "string" }
    },
    "stdin" :
    {
      "title" : "Standard input",
      "description" : "Data to feed to the command's standard input.",
      "minOccurs" : 0,
      "maxOccurs" : 1,
      "schema" : { "type" : "string", "contentMediaType" : "application/octet-stream" }
    },
    "arguments" :
    {
      "title" : "Command arguments",
      "description" : "Arguments to feed to the command.",
      "minOccurs" : 0,
      "maxOccurs" : 1,
      "schema" : {
         "type" : "array",
         "items" :
         {
            "oneOf" : [
               { "type" : "string" },
               { "type " : "integer" },
               { "type" : "number" },
               { "type" : "boolean" }
            ]
         }
      }
    },
    "requirements" :
    {
      "title" : "Command requirements",
      "description" : "Prerequesites for executing the command",
      "minOccurs" : 0,
      "maxOccurs" : 1,
      "schema" :
      {
         "type" : "object",
         "properties" :
         {
            "docker" :
            {
               "type" : "object",
               "properties":
               {
                  "pull" : { "type" : "string" }
               }
            }
         }
      }
    }
  },
  "outputs" : {
    "stdout" :
    {
      "title" : "Standard output",
      "description" : "Standard output from the command",
      "schema" : {
         "type": "string",
         "contentMediaType": "application/octet-stream"
      }
    },
    "outFile1" :
    {
      "title" : "Output file opengeospatial/NamingAuthority#1",
      "description" : "File for returning an additional output referred to as 'outFile1'",
      "schema" : { "type": "string", "contentMediaType": "application/octet-stream" }
    },
    "outFile2" :
    {
      "title" : "Output file opengeospatial/NamingAuthority#2",
      "description" : "File for returning an additional output referred to as 'outFile2'",
      "schema" : { "type": "string", "contentMediaType": "application/octet-stream" }
    },
    "outFile3" :
    {
      "title" : "Output file opengeospatial/NamingAuthority#3",
      "description" : "File for returning an additional output referred to as 'outFile3'",
      "schema" : { "type": "string", "contentMediaType": "application/octet-stream" }
    }
  }
}

The "input" named argument is not itself an input, but rather lists the actual inputs (ie: fillDistance and classes) that are identified with as sub-input field. (ie: fillDistance and classes) that are identified with as sub-input field

The mechanism I had suggested from the start in the Part 3 Deploy(ed)Workflow conformance class for defining inputs is to use { "input" : "{inputName}" anywhere in the definition of an execution request workflow intended to be deployed as a process. I also realized we should do the same with "output" in the outputs to pick specific outputs and/or rename them.

As example, taking the Execution body from Scenario 5 in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1046682392, and inferring what the Workflow process description might look like

You are suggesting to POST a process description to define processes based on workflows, but I am suggesting to POST an execution request workflow to define processes based on workflow (regardless of whether it is posted directly to /processes or as the executionUnit of an application package).

Scenario 4 was my original example of a workflow defined with the intention of being deployed as a process (defining and naming inputs). If I adapt Scenario 5 with the intent to deploy the workflow as a process, as in your new example (it originally did not leave anything undefined to be defined as inputs externally), it would look like:

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",
  "inputs" : {
     "dataset" : { "input" : "osmDataset" },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",
        "inputs" : {
           "data" : { "input" : "pointCloud" },
           "fillDistance" : { "input" : "fillDistance" },
           "classes" : { "input" : "pcClasses" }
        },
        "outputs" : { "dsm" : { } }
     },
     "preference" : "shortest",
     "mode" : "pedestrian",
     "waypoints" : { "value" : {
        "type" : "MultiPoint",
        "coordinates" : [
           [ -71.20290940, 46.81266578 ],
           [ -71.20735275, 46.80701663 ]
        ]
     } }
   }
}

A process description would be automatically generated from that execution request workflow, with inputs: osmDataset, pointCloud, fillDistance, pcClasses, and a Route Exchange Model output (implied from the top-level RoutingEngine process). All the types of the inputs can be inferred from where they are used by considering the process description of the RoutingEngine and PCGridify processes used in the workflows.

Where I am against is when a Part 2 deployment is involved with request contents that strongly imitates CWL recommended by Best Practices, although not actually making use of it.

That really is not where I was coming from. I just wanted to show how one could define a base process that executes a command using a Part 3 execution request workflow. Of course, your example was a large part of my inspiration and it used CWL, but executing commands with arguments and requiring a docker container to be pulled is not specific to CWL :).

For clarifying the intent in the process description for a user that doesn't necessarily has all the knowledge about each part of the workflow, or even for another user that was not the one that deployed the process (therefore cannot even be aware of what happens behind the scene), I think this would be preferable.

A key part of the design of Part 3 workflow execution requests, is that for any nested remote sub-process, a processing engine can simply POST the nested { "process" : ... } block to that "process"'s execution end-point and completely ignore the deeper content, if it wishes to keep things simple. It could decide otherwise if it has some EMS-like capabilities, e.g. if it sees that the same thing is posted to different processes, it could try to orchestrate things more efficiently. Each process is required to provide a full process description, so it should be very easy to infer everything from a workflow execution request, combined with the descriptions of those processes used by the workflow.

fmigneault commented 2 years ago

@jerstlouis

For an implementation that does not already use CWL and does not plan on supporting CWL though, it might be an interesting potential option to define base processes.

Certainly, it could be one approach to use your own and custom schema/standard if your implementation wanted to do so (OGC API doesn't limit you), but I am still advising against it. I think it is not wise for OGC API - Processses to suggest users to apply this new potential option at all, because that custom implementation would always be drastically lacking documentation compared to existing solutions. I'm not sure if you have looked at CWL specification, but the sheer amount of schema and documentation needed to "only" define how to run an app and a workflow is quite daunting and surprising. OGC API - Processses is much better off leaving that portion to a dedicated specification (CWL or whichever other of your choosing), as it did with generic executionUnit, and concentrate only on the Processes description portion. I also think that proposing an half baked custom definition only for the purpose of Part 3 would go against previously mentioned recommendation to "consider supporting the OGC Application Package encoding for describing the process" if possible. Proposing a custom schema in Part 3 instead of using the one available from Part 2 creates this situation where this standard is not even making use of its own recommendations...

Maybe I'm not properly seeing the logic in your ExecuteCommand example, or at least it doesn't align with existing Deploy concepts. When a process is traditionally deployed with Part 2, you provide the inputs/outputs details on how to directly submit and retrieve data for execution of that process, and which are the same inputs/outputs returned for the process description.

Using the above generic ExecuteCommand proposal, since the inputs/outputs are what define the "internal command itself" rather than what the "internal command gets called with", you need to have nested inputs (as specified in the field you named arguments). This causes those nested inputs to be harder to define properly, since all of them are bundled under the same schema. Considering that each input in OGC API - Processes can be submitted in many different ways (raw values, bbox, object of custom schema, href, and now collection as well being proposed), this aggregated multi-input multi-schema definition becomes rapidly extremely convoluted.

Maybe you are used to working with OGC API - Processes and all those data/schema-type combinations have become very easy to understand for you, but I can tell you that even with current definitions of Core and Part 2, I often have to explain many of those subtleties to new users. I'm trying to define an easier API to employ by end users, not yet again another variant that will confuse people. Therefore, I insist on reusing Application Package format for this kind of definition as it is already exposed by Best Practices and Part 2.

[...] but I am suggesting to POST an execution request workflow to define processes based on workflow (regardless of whether it is posted directly to /processes or as the executionUnit of an application package).

I don't think it is wise to mix concepts. RESTful creation with POST /processes naturally suggests it is creating a process, not its execution, especially since POST /processes/{processId}/execution also exists. POSTing an execution request on the deployment endpoint will just cause confusion, as this whole thread demonstrated.

jerstlouis commented 2 years ago

@fmigneault

to use your own and custom schema/standard

I am not sure exactly which custom schema/standard you are referring to, but I don't believe that accurately reflects the amount of efforts and collaboration that went into the Part 3 Workflows definition based on execution requests.

OGC API - Processses is much better off leaving that portion to a dedicated specification (CWL or whichever other of your choosing)

Part 3 is (at least so far) a specification dedicated to defining workflows by extending the Part 1 execution request.

The goal of Part 3 is to specify a notation to define workflow, with a focus on integrating easily with the execution and access of data from local and remote OGC API - Processes and Collections. From the very start, one of the objectives was to enable the use of a workflow definition to define new processes (with a greater focus on leveraging local or remote processes already deployed).

the sheer amount of schema and documentation needed to "only" define how to run an app and a workflow is quite daunting and surprising.

That would be a good reason to consider the simple approach based on OGC API - Processes - Part 1: Core execution requests with the few additions proposed in Part 3.

and concentrate only on the Processes description portion

Part 1: Core already covers the process description, and as I was explaining the description is a key enabler that makes the workflow as execution requests work.

a half baked custom definition

:\ I admit that the workflow definition specification and documentation require improvements, the more complex scenarios need to be more battle-tested, and it would be good to have more implementations (though there are already multiple ones). I do realize it has not been used nearly as much as CWL.

purpose of Part 3 would go against previously mentioned recommendation to "consider supporting the OGC Application Package encoding for describing the process" if possible. Proposing a custom schema in Part 3 instead of using the one available from Part 2 creates this situation where this standard is not even making use of its own recommendations...

It sounds like we are still mixing up workflow definition (execution unit) with the deploy payload. The workflow definition is intended to be a valid execution unit for an application package (the one of my choosing). An implementation could support as a payload to POST /processes any or all of:

An application package containing a CWL execution unit along with a process description
A CWL execution unit from which one can automatically infer the process description
An application package containing a Part3 execution request workflow along with a process description
A Part 3 execution request workflow from which (and from the description of the processes it refers to) one can automatically infer the process description

It does not go against the recommendation: the Part 3 workflow definition based on an execution request is a valid executionUnit and the recommendation to support it inside an application package still stands.

It does not matter whether we are talking about base processes invoking commands, or higher level workflows invoking base local or remote processes... Just like CWL could do both, the Part 3 execution request workflows can also express both.

Maybe I'm not properly seeing the logic in your ExecuteCommand example

That might be the case... I can keep trying to explain things as clearly and simply as possible until it's crystal clear ;) If anything, it should be helpful in making the specification clearer and unambiguous! It is actually quite simple, but it might be out of the box -- I think it might be easier to understand for someone without prior experience with another approach.

When a process is traditionally deployed with Part 2, you provide the inputs/outputs details on how to directly submit and retrieve data for execution of that process, and which are the same inputs/outputs returned for the process description.

Yes, you provide that in the ProcessDescription. But you mentioned that your implementation, which also takes in CWL in the executionUnit, can actually infer most of that information from the CWL executionUnit, right? So think of the Part 3 - execution unit workflow as the equivalent of that CWL -- it is the execution unit, not the process description!

Using the above generic ExecuteCommand proposal, since the inputs/outputs are what define the "internal command itself" rather than what the "internal command gets called with", you need to have nested inputs (as specified in the field you named arguments).

The base process is not a key use case for Part 3 -- it was not in the 6 scenarios I presented above, and I had not given thought to it until this cross-walk exercise (so far we implement our processes internally). But I think what you are mentioning here applies to the Part 3 workflows regardless of whether it defines a base process (e.g. command line process with a well-known ExecuteCommand built-in process), or a workflow involving processes already deployed, and possibly even just a regular process execution? In a sense it is already a characteristic of the Part 1 description & execution request?

This causes those nested inputs to be harder to define properly, since all of them are bundled under the same schema.

I am not exactly clear on what you mean by bundled under the same schema, I am guessing you mean that the ExecuteCommand process description defines the arguments input as an array whose elements can be different things? When writing the workflow using ExecuteCommand, e.g. to define the PCGridify base process as in that example, each use of "input" (just as if it were an "href" referencing a URL) can specify its own "schema", because it can be a qualifiedInputValue, which specifies how this process (the PCGridify base process) is passing this input along to the ExecuteCommand built-in process being invoked. That then implies that the input has to be passed that way from the invoker of the process being defined by the workflow (PCGridify) and will be reflected in the resulting process description, but can be augmented by additonal ways in which the engine can accept those inputs and convert them before invoking ExecuteCommand.

Considering that each input in OGC API - Processes can be submitted in many different ways (raw values, bbox, object of custom schema, href, and now collection as well being proposed), this aggregated multi-input multi-schema definition becomes rapidly extremely convoluted.

In addition to collection, for the deployable workflow scenario, the { "input" : .... } is one more thing that an inputValue can be for these deployable workflows. However, those different things are always implied, and never described explicitly in the process description. So I don't see the convolution here, it never comes up in either the process description nor the execution requests/workflows ...

I don't think it is wise to mix concepts. RESTful creation with POST /processes naturally suggests it is creating a process, not its execution, especially since POST /processes/{processId}/execution also exists.

All of my examples POSTing to /processes were about creating a process, not executing the process that is being deployed.

But just like with POSTing an application package including a process description + an execution unit containing CWL creates a process, POSTing an application package including a process description + an execution unit containing an execution request of other processes (Part 3 - Workflow) will also create (deploy) a new process. Then a POST to that newly created process at /processes/{processId}/execution will execute the new process!

In that example defining the PCGridify command (let's call it Scenario 7), POSTing the deployable workflow defined by an execution request invoking ExecutionCommand inside an application package execution unit at /processes creates /processes/PCGridify, and then POSTing to /processes/PCGridify/execution (Scenario 5) will execute the PCGridify process.

I hope this helps make everything more clear... Thanks for spending so much efforts trying to understand all this and providing feedback.

fmigneault commented 2 years ago

I am not sure exactly which custom schema/standard you are referring to

That one specifically: https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824

I'm against the WellKnownProcess definition that doesn't add anything beneficial/new in my opinion other then over-engineering things. If Deploy is used, it should be done though Part 2 methodologies, and that's it. If you want to define a builtin process without using Deploy, and which handles these command, arguments, etc. inputs in a special way, you can. I don't think it warrants having any special treatment compared to any other process description as defined by Core. Once that process exists and can be described by the API, the way you parse the schema of your inputs for executing it under the hood is irrelevant.

Other than that, the rest of Part 3 proposals are good by themselves.

An implementation could support as a payload to POST /processes any or all of: [...]

An application package containing a Part3 execution request workflow along with a process description

A Part 3 execution request workflow from which (and from the description of the processes it refers to) one can automatically infer the process description

The issue I find with POSTing execute request contents to a deployment endpoint is that, since that workflow definition would become deployed, you wouldn't need to POST the execution as-is anymore for executing it, since the workflow doesn't need to be redefined dynamically anymore. So why even bother submitting a payload formatted as execution-schema rather then process description to describe the process to be represented? Process description schema can be more verbose, such as providing combinations of applicable schema, formats, etc. supported by inputs to better define the workflow, which can palliate some edge cases when there is some ambiguity by inferring process descriptions from only the execution input values. I believe it is better to define the processes from their actual definitions instead of leaving space for interpretation from raw values.

jerstlouis commented 2 years ago

That one specifically: https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1048109824

But that is simply an instance of the regular Part 1: Core execution request + use of the Part 3 Deploy(able)Workflow capability ("input": ...).

If Deploy is used, it should be done though Part 2 methodologies, and that's it.

But it is done through Part 2 methodology: you POST it to /processes, and you can also provide it inside an application package along with a process description.

I'm against the WellKnownProcess definition that doesn't add anything beneficial/new in my opinion other then over-engineering things. If you want to define a builtin process without using Deploy, and which handles these command, arguments, etc. inputs in a special way, you can.

The ExecuteCommand well-known process to define a base process is really just one example of using the capability to deploy workflows as processes. Scenario 4 was my original example. We can avoid discussing this scenario 7 example and consider scenario 4 instead if you really don't like this ExecuteCommand approach. This is certainly more appealing to me than CWL, since I don't have a CWL implementation, and bringing in a complex third-party library is what I would consider over-engineering for our purposes.

I don't think it warrants having any special treatment compared to any other process description as defined by Core.

But there is no special treatment at all of process descriptions! The concept of well-known process is that a client recognizing a well-known process can expect a particular process description (as defined in Core). An example of that is the Processes profile of OGC API - Routes. Another example would be if we standardize something like our RenderMap process. Another example is the CoverageProcessor I mentioned in Scenario 4, which understands expressions to define bands (Scenario 4). Well-known processes can work with Part 1 alone.

The issue I find with POSTing execute request contents to a deployment endpoint is that, since that workflow definition would become deployed, you wouldn't need to POST the execution as-is anymore for executing it, since the workflow doesn't need to be redefined dynamically anymore.

After a workflow has been deployed as a process, you can forget everything inside that workflow, and the resulting process will have its own description based on the inputs and outputs that this workflow defined. Then when you execute that process, you submit an execution request based on the workflow-defined process. That works exactly the same if you defined that workflow process using CWL instead of those Part 3 execution request. But you can also define ad-hoc workflows using multiple processes with Part 3.

Basically, Part 3 workflows can be used either to:

deploy a workflow as a process (in which case you will normally leave some inputs to be defined, with "input": ...) (POST /processes) (can be inside an application package), or
execute it in an ad-hoc manner, either requesting to get back a dataset or collection, or execute it sync or async to get back results (POST /processes/{processId}/execution), or
create a persistent virtual collection from it (which appears as a regular collection, but is always using the latest data, and potentially provides a link to the source workflow for re-usability / reproducibility) (POST /collections).

It's possible that CWL could also work for all of these things, so it could be a different Content-Type for those POST operations (other than when using an application package which has its own Content-Type and mechanism to specify the executionUnit type).

So why even bother submitting a payload formatted as execution-schema rather then process description to describe the process to be represented?

Because a process description cannot define a workflow! It cannot refer to processes to use as part of the workflows, and it can only define the types of inputs and outputs, not the fixed parameters or actual data sources to use, e.g. specific data sources which are not to be specified as an input. I am quite confused why you seem to think that all you need to define a process is its description. Actually, this is what really bothers me about Part 2 right now: it talks about POSTing a process description to deploy a process, instead of actually posting the process itself (which to me is defined by the execution unit, not the description! -- the description is just metadata for that process, a lot of which can often be inferred from the execution unit).

To me, the most important part of the application package, is the execution unit. That is the process definition. The description is metadata about the process (it describes the process).

Process description schema can be more verbose, such as providing combinations of applicable schema, formats, etc. supported by inputs to better define the workflow, which can palliate some edge cases when there is some ambiguity by inferring process descriptions from only the execution input values. I believe it is better to define the processes from their actual definitions instead of leaving space for interpretation from raw values.

You could still provide a process description alongside the execution request (or the CWL) that defines the workflow. This way you could avoid using schema inside the execution request, and define the types in the process description. But you would still need the execution request as well to specify the processes to use and wire all the inputs and specify any fixed inputs to those internal processes. It is also possible to specify multiple schemas etc. directly in the qualifiedInputValue of the execution requests inputs, e.g. with oneOf, though that might make less sense.

jerstlouis commented 2 years ago

The best way to understand the ideas behind Part 3 - Workflows is to picture that it starts from the output(s) that the workflow generates (because it allows to be client-driven, i.e. the processing is triggered when the end-user requests output data), and wires in inputs, much like a real audio stack or Reason's: stack

Each process (including the top-level one) is just like a synthesizer or effect machine with inputs and outputs.

When you deploy a workflow, you're building a blackbox machine with a bunch of other machines inside.

The { "input": ... } fields connects the outside inputs of that blackbox to inputs of machines inside the blackbox (therefore the input types should be exactly the same, hence why they can be inferred from that inner machine's description).

Very interestingly, one could actually implement an actual audio stack with Part 3 workflow, where a base input process is a MIDI receiver hooked onto an electric piano, and the client streams from the output of the workflow to speakers. This would enable musicians to jam live in the metaverse ;)

fmigneault commented 2 years ago

@jerstlouis

But there is no special treatment at all of process descriptions! The concept of well-known process is that a client recognizing a well-known process can expect a particular process description (as defined in Core). An example of that is the Processes profile of OGC API - Routes. Another example would be if we standardize something like our RenderMap process. Another example is the CoverageProcessor I mentioned in Scenario 4, which understands expressions to define bands (Scenario 4). Well-known processes can work with Part 1 alone.

That was my point. Since everything in ExecuteCommand can be defined using typical Core definition, it doesn't make this process any more special than any other process such as RenderMap, and therefore, shouldn't be added in Part 3. The suggested WellKnownProcess is only a specialization of supported definitions by Core, and OGC API - Processes doesn't need to "standardize" or mention it at all. It doesn't add new functionalities.

Because a process description cannot define a workflow! It cannot refer to processes to use as part of the workflows, and it can only define the types of inputs and outputs, not the fixed parameters or actual data sources to use, e.g. specific data sources which are not to be specified as an input. I am quite confused why you seem to think that all you need to define a process is its description.

I have to disagree on that. I already demonstrated how this is perfectly doable in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1050091683 (snippet below). Using nested processes for inputs of the parent process could totally define a process workflow.

{
  "process" : "https://example.com/ogcapi/processes/RoutingEngine",
  "inputs" : {
     "dataset" : { "formats": [ { "mediaType": "application/x-osm-roads" } ] },
     "elevationModel" :
     {
        "process" : "https://example.com/ogcapi/processes/PCGridify",
        "inputs" : {
           "data" :  { "formats": [ { "mediaType": "application/x-point-cloud" } ] },
           "fillDistance" : { "schema": { "type": "integer" }},
           "classes" : { "schema": { "type": "array", "items": {"type": "string", "enum":  ["roads", "sidewalks"] }}},
        },
        "outputs" : { "dsm" : { "formats": [ { "mediaType": "application/x-dms" } ] } }
     },
     [...]
  }, 
  "outputs": { [...] }
}

Using process definitions to create viable workflows is the typical methodology employed for defining how processing graph workflows should chain inputs and outputs together. It is not required to have the execute values. Workflows can work with the inputs/outputs definitions, types, media-type formats, schema and cardinality. Without having any input value, it is possible to verify this way if a workflow even makes sense, that there is concordance between its step inputs/outputs.

Even in your case, where you define a workflow based on values, what really defines the workflow chain and allows its execution to work is if the produced output data has types and format that matches what is expected for the next process inputs in order to pass it around. The only thing you do differently, is that you infer those types/formats dynamically while processing the steps, rather than pre-resolving them using process descriptions. I would argue that this is more prone to errors than having fully defined processes and workflows that can pre-validate if chaining works, as resolving dynamically could cause errors to only be detected very late down the execution chain and can waste computing resources, rather then early detection of this issue from the definition, but this is another debate. The fact remains that if following processes in the chain cannot accept the results of a previous one, the execution will fail, and what dictates which inputs/outputs a process can work with or not is its description.

jerstlouis commented 2 years ago

@fmigneault

The suggested WellKnownProcess is only a specialization of supported definitions by Core, and OGC API - Processes doesn't need to "standardize" or mention it at all. It doesn't add new functionalities.

Correct, I was not suggesting that ExecuteCommand be defined by Part 3. But it is special in that it may be defined somewhere at some point, whether a later part or a registry of well known processes, like OGC API - Routes may eventually define a Routing well known process, or Maps may eventually define RenderMap.

Using nested processes for inputs of the parent process could totally define a process workflow.

But nested processes were never suggested to be added to a process description's inputs. In the description, you should not care where the data come from, whether it is a process or an embedded file or an OGC API collection. Part 3 as we had proposed it does not get rid of the process description at all, but the target of the extension is the execution request, not the description.

To me, it's really twisting the meaning of the process description to extend it this way. Just like CWL, the workflow goes in the execution unit, and is based on the execution request. And just like CWL, a process description can still be provided alongside the execution unit, but not mixing the two.

The execution unit defines the process, while the process description describes the process.

It is not required to have the execute values. Workflows can work with the inputs/outputs definitions, types, media-type formats, schema and cardinality. Without having any input value, it is possible to verify this way if a workflow even makes sense, that there is concordance between its step inputs/outputs.

In full agreement with this.

The only thing you do differently, is that you infer those types/formats dynamically while processing the steps, rather than pre-resolving them using process descriptions.

The idea is that each hop will resolve it using the process description of each individual process, and also consider the other OGC API capabilities in the case of collection input and/or output. This would all be done before any processing begins. In the case of a deployed workflow, that would be done at the time of deployment, not execution.

Even in your case, where you define a workflow based on values,

Values are either plugged in to the deployable workflow's undefined inputs ("input" : ...), whose type could possiby be defined in the process description of the overall workflow (but was conveniently done directly next to "input" : ... in my example), or are fixed values that are hardcoded for the purpose of that workflow.

I would argue that this is more prone to errors than having fully defined processes and workflows that can pre-validate if chaining works, as resolving dynamically could cause errors to only be detected very late down the execution chain and can waste computing resources, rather then early detection of this issue from the definition, but this is another debate.

The pre-validation step is actually a key thing that we considered. With the collection / dataset output, the idea is that you can pre-validate the whole workflow before doing any processing at all. I also suggested that this be supported for Core as well in opengeospatial/NamingAuthority#101. Even without this capability, each hop would still be able to validate the next hop from its process description before any processing is done. The full validation should go a step further and guarantee that whole workflow will actually work in practice.

The fact remains that if following processes in the chain cannot accept the results of a previous one, the execution will fail, and what dictates which inputs/outputs a process can work with or not is its description.

Each process used in a workflow has a process description, so that can all be fully checked with the approach we proposed.

fmigneault commented 2 years ago

@jerstlouis

But nested processes were never suggested to be added to a process description's inputs.

My example only showed how a nested definition could be used to deploy a process that resembles the proposed Execution body. The resulting workflow process could be represented in a totally different manner, such as using CWL, but that is also left to the implementer regarding how they support Part 2.

In the description, you should not care where the data come from, whether it is a process or an embedded file or an OGC API collection. Part 3 as we had proposed it does not get rid of the process description at all, but the target of the extension is the execution request, not the description.

I don't think I ever insinuated that. On the contrary, I defend that indeed workflows can be entirely defined purely with process descriptions. Therefore, I don't need to know whether process 1 and process 2 I/O are passed as raw data, file or collection, only that they agree with each other. This is why my propositions are all based around deploy/describe process, since the only place values are actually involved is during execution, which are irrelevant to define the workflow as long as data types are specified.

I think Part 3 should be renamed Execution Workflows to be more specific, since "Workflows" itself is too generic, and brings a lot of confusion into systems that can fully define workflows already. There could then be another extension called Deployable Workflows or *Workflow Description, without causing ambiguity with terms employed by the standard. I can see both solutions (deploy+describe Workflow, then execute it) and (direct execute with dynamic descriptions) as valid approaches, but they inevitably clash since they operate in opposite manners.

Inferring the "workflow runtime definition" from the execution payload remains a problem IMO for OGC API - Processes since following the execution, there is no available GET request to obtain a definition of the full process of that complete workflow.

jerstlouis commented 2 years ago

@fmigneault

This is why my propositions are all based around deploy/describe process,

To me, to deploy and to describe a process are completely different things, and in the current text of Part 2 right now this distinction is not made clear enough, in my opinion.

With OGC API - Processes - Part 1: Core:

A GET to /processes/{processId} describes the process (with a process description)
A POST to /processes/{processId}/execution executes the process (with a process execution request)

With the Part 2 & CWL / Application Package best practices:

the processDescription describes the process resulting from the workflow,
the executionUnit defines the process (by defining the workflow behind the process, e.g. using CWL), and
the Part 2 capability to POST the application package to /processes deploys this process.

The distinction between description and definition is fundamental and very much akin to a function prototype/declaration vs. a function definition with its actual code/body/implementation.

I feel like we might not be in full agreement about this yet before even considering deployable workflows, which makes it difficult to be on the same page, as deployable workflows really sit on the fence between Parts 2 & 3.

Executing processes is what workflows do, so to me the closest construct we have to define the process is the process execution request schema (which of course does not execute the process being defined itself, but the internal processes used to define it).

Therefore, I don't need to know whether process 1 and process 2 I/O are passed as raw data, file or collection, only that they agree with each other.

In full agreement with that.

the only place values are actually involved is during execution, which are irrelevant to define the workflow as long as data types are specified.

That is not true if as part of the workflow definition, any value originates from a fixed HREF or OGC API Collection or is hard-coded as a literal, constant for that specific workflow.

I think Part 3 should be renamed Execution Workflows to be more specific, since "Workflows" itself is too generic, and brings a lot of confusion into systems that can fully define workflows already.

I believe the premise of this issue that CWL, OpenEO and Part 3 workflows are three different ways to define a workflow is good, and I would support having a conformance class for each as the payload of the executionUnit, whether it is executed ad-hoc by POSTing the workflow to a /execution end-point, or deployed with Part 2 (whether inside an application package execution unit, or directly). Aren't CWL, OpenEO and Part 3 workflows all executing processes?

There could then be another extension called Deployable Workflows or Workflow Description, without causing ambiguity with terms employed by the standard.

"Deployable Workflows" was suggested as a separate capability employing both the Part 2 ability to deploy a process and the Part 3 mechanism to define a workflow that wires inputs and outputs of the resulting process to inputs and outputs of processes used internally within the workflow.

I am not sure what you mean by Workflow Description -- it could be interpreted either:

as the description of the resulting process, e.g. the process description you get at /processes/testWorkflow after POSTing an application package containing a process description of testWorkflow + a CWL execution unit for it, or
as the actual full definition of the workflow (the CWL execution unit)

I can see both solutions (deploy+describe Workflow, then execute it)

It seems like here you use describe in the sense of what I call define.

and (direct execute with dynamic descriptions) as valid approaches,

I agree that direct ad-hoc execution of workflow (at an execution end-point), and define+deploy and then execute (whether with Part 1 or as part of an ad-hoc execution again) both make sense.

but they inevitably clash since they operate in opposite manners.

I don't find that they clash or operate that differently at all -- they actually complement each other quite nicely. Deploying a workflow, regardless of whether you define it with CWL, OpenEO or Part 3 deployable workflows based on Part 1 execution request, creates a new process, and you can then use that deployed process, just like any other process, either:

directly with Part 1,
in Part 3 ad-hoc workflows (possibly with CWL, OpenEO or Part 3 execution request workflows extension i.e. nested process),
as part of another deployable workflow to be defined (possibly with CWL, OpenEO or Part 3 execution request workflows extension i.e. nested process + inputs/outputs wiring capability "input" : ...)

Being able to use the exact same basic execution request syntax / schemas in all three of those cases is why I am so convinced that execution requests are perfect for defining workflows (whether they are deployed or not).

Inferring the "workflow runtime definition" from the execution payload remains a problem IMO for OGC API - Processes since following the execution, there is no available GET request to obtain a definition of the full process of that complete workflow.

With deployed workflows defined using the extended execution requesting syntax proposed in Part 3, you can:

Request a process description of the resulting process (clients not having to care about the workflow behind it),
Optionally allow to retrieve that workflow definition behind the process (the extended execution request used to define the workflow), e.g. to tweak it and deploy a modified version, or re-use it elsewhere

The same applies if CWL is used to define the workflow instead. In that case the workflow definition in CWL (i.e. the apppkg executionUnit) is that could be returned. The implementation could also support automatically translating between CWL / Part 3 execution request / OpenEO so that you could submit a workflow in one and retrieve it in another workflow definition language.

pvretano commented 2 years ago

@jerstlouis you said ...

"To me, to deploy and to describe a process are completely different things, and in the current text of Part 2 right now this distinction is not made clear enough, in my opinion."

I am not exactly sure how one could be confused about the difference between deploying a process (i.e. adding it to the API) and describing a process (i.e. getting its current definition) but can you please be a little more specific about where the wording is ambiguous so that I can tighten it up!?

fmigneault commented 2 years ago

@jerstlouis

I am also not sure to see where there is confusion between the deployment and description portions. Similar to a feature catalog, or pretty much any REST API that has POST/GET requests revolving around "some object", the data that you want to be retrieved by GET will be strongly similar to the POSTed one. Obviously, the POST information can be extended or modified, so the GET data is not necessarily 100% the same as the POST contents, but they share a lot of similarities by definition. They are very different operations, but I wouldn't say they are completely different things as one strongly depends on the other (the POSTed process will be the one eventually described).

Whenever I mentioned Deployable/Execution Workflows or Workflow Description, I am using the same nomenclature as defined in Core and Part 2. In other words, some schema that allows the user to POST nested processes on /processes to deploy a "workflow", which can then be described on GET /processes/{processId}, and executed with POST /processes/{processId}/execution. In a way, you can see it as a "Process that just so happens to define a workflow chain" instead of a single operation. A deployed/described workflow would essentially be the same as any other atomic process when executed, since there is no need to redefine the "workflow chain". On the other hand, an Execution Workflow (i.e.: Part 3) that is directly POSTed on the execute endpoint (without prior deploy/describe requests), does not have its "workflow chain" defined yet. It is resolved dynamically when processing the execution contents.

Because they work in different manner, I believe it is important to avoid an ambiguity in the terminology.

Executing processes is what workflows do, so to me the closest construct we have to define the process is the process execution request schema

Regardless of Deployable/Description/Execution Workflow, yes, the process execution will apply the "workflow chain", but where this chain of processes comes from depends on its definition. Maybe on the Execution Workflow side, the closest construct would be nested processes with input values/constants, but I argue that it is possible to deploy and describe that "parent construct" that defines the "workflow chain" itself, just like any other atomic process, and as presented with my original sample (WorkflowStageCopyImages) and the following with https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1047047398 (3rd code) sample.

I continue to defend that defining a "workflow chain" (NOT an Execution Workflow) is definitely possible without any data value. The only thing required is data types, and how they connect to each other. Just like it is the case with current Core and Part 2, data values are involved only when POSTing on the execution endpoint, and everything else before execution is involved works in terms of types/schema.

I see the resolution of "workflows" as follows:

↓ (1) deploy that provides one way or another details with process chain: 

[workflow deployment] process1.output ("data-type") -> process2.input ("data-type")

↓ (2) described as 

[workflow definition] process1.output ("data-type") -> process2.input ("data-type")

↓ (3) executed with values of various sources

[workflow execution] process1.output (raw value typed "data-type") -> process2.input (raw value typed "data-type")
[workflow execution] process1.output (url href to typed "data-type") -> process2.input (url href to typed "data-type")
[workflow execution] process1.output (collection of typed "data-type") -> process2.input (collection of typed "data-type")
...

Part 3 proposal simply accomplishes steps (2) and (3) at the same time to negotiate how to pass around the data values, but my opinion is that it is possible to do those steps separately, allowing users to review and offer a process description that consists of a "workflow chain" before executing it.

I do not believe deploying or describing workflows inferred directly and automatically from the execution contents is a good idea. I can see many potential ambiguities that would be troublesome regarding how to correctly resolve the steps. I also find that POSTing the "workflow chain" each time on the execution endpoint doesn't align with deploy/describe concepts. The whole point of deploy is to persist the process definition and reuse it. Part 3 redefines the workflow dynamically for each execution request, requiring undeploy/re-deploy or replace each time, to make it work with Part 2. Alternatively, if undeploy/re-deploy/replace is not done each time, and that the "workflow chain" remains persisted, then why bother re-POSTing it again as in Part 3 instead of simply re-using the persisted definition? They are not complementary on that aspect.

Being able to use the exact same basic execution request syntax / schemas in all three of those cases is why I am so convinced that execution requests are perfect for defining workflows (whether they are deployed or not).

It is not exactly the same though. For Execution Workflow, we need to add more details such as the outputs in the nested process to tell which one to bubble up to the parent process input. It is not a "big change", but still a difference. A pre-deployed/described Workflow would not need this information, since all details regarding the "workflow chain" already exist. Only in that case, the execution request is exactly the same syntax as for any process.

Request a process description of the resulting process (clients not having to care about the workflow behind it)

This is the fundamental problem I have with POSTing an Execution Workflow. Clients and service providers could care a lot about the workflow behind it. Understanding where final output data comes from, how it was processed, why the processing costs X, how to improve the pipeline, etc. are all details that are much easier to explain when an explicit description illustrates the full "workflow chain". I would argue this is in fact one of the most important aspects for understanding data provenance.

Optionally allow to retrieve that workflow definition behind the process

I have not seen anything regarding that. Maybe you can provide a reference? From my understanding, once the Execution Workflow is POSTed, the result obtained as output is the same as when running an atomic process. Considering that, the workflow definition is effectively lost in the engine that applies the "workflow chain". The only workaround to this would be to deploy that workflow before executing it, but again, this poses a lot more problems as previously mentioned.

jerstlouis commented 2 years ago

@pvretano

I am not exactly sure how one could be confused about the difference between deploying a process (i.e. adding it to the API) and describing a process (i.e. getting its current definition)

The ambiguity seems to be between the definition vs. description of a process, as in that statement you made right there.

Using my understanding of those terms, when deploying a process using an OGC application package, a description is provided in the processDescription field, whereas the definition is provided in the executionUnit field.

When retrieving the process description (GET /processes/{processId}, it is the process description that is returned, not its definition (e.g. the CWL executionUnit).

Optionally being able to retrieve the definition of a process as well makes sense if you want to allow users to re-use and adapt a particular workflow, but that would be a different operation (e.g. GET /processes/{processId}/workflow).

I had initially suggested this capability for a workflow deployed as a persistent virtual collection, but it applies to a workflow deployed as a process as well):

A collection document resulting from a workflow may expose its source process (workflow) execution document.

About:

but can you please be a little more specific about where the wording is ambiguous so that I can tighten it up!?

First I want to point out that the README gets it perfectly right:

This extension provides the ability to deploy, replace and undeploy processes, using an OGC Application Package definition containing the execution unit instructions for running deployed processes.

and so does the HTTP PUT description:

The HTTP PUT method is used to replace the definition of a previously, dynamically added processes that is accessible via the API.

Right below is where it gets muddied:

This extension does not mandate that a specific processes description language or vocabulary be used. However, in order to promote interoperability,

this extension defines a conformance class, OGC Application Package, that defines a formal process description language encoded using

The OGC Application Package includes BOTH a description and a definition (called executionUnit). My argument is that the executionUnit is the most important piece and as a whole the package should be considered a definition, as in the README and the PUT description. That is because you could often infer most or all of the description from the executionUnit.

Also in the ASCII sequence diagram below:

Body contains a formal description of the process to add (e.g. OGC Application Package)

and the other one.

Note that as we discussed previously, a per-process OpenAPI description of a process would make a lot of sense for Part 1 (e.g. GET /proceses/{processId}?f=oas30) . Such an OpenAPI document would describe the process to be able to execute it, but does not define the process or the workflow behind it in any way. So it's really important to clearly distinguish between description and definition.

fmigneault commented 2 years ago

@jerstlouis

Optionally being able to retrieve the definition of a process as well makes sense if you want to allow users to re-use and adapt a particular workflow, but that would be a different operation (e.g. GET /processes/{processId}/workflow).

I agree with you on this. For our implementation, we actually use GET /processes/{processId}/package to make it generic since it is not always a workflow. Process description and definition are indeed retrieved by separate requests.

I think the processDescription in the deployment payload is adequately named, as it is intended only for the process description returned later by GET /processes/{processId} (extended with some platform specific metadata).

This extension does not mandate that a specific processes description language or vocabulary be used. However, in order to promote interoperability,

Instead of even saying process definition for that sentence, I suggest to explicitly use execution unit definition to avoid the possible description/definition confusion altogether. It is only the execution unit (CWL, etc.) that can be anything.

jerstlouis commented 2 years ago

@fmigneault

I am also not sure to see where there is confusion between the deployment and description portions.

I've clarified above that the ambiguity is between description and definition.

Similar to a feature catalog, or pretty much any REST API that has POST/GET requests revolving around "some object", the data that you want to be retrieved by GET will be strongly similar to the POSTed one.

Unfortunately here we currently have a clear mismatch between the POST/PUT and the GET. The GET returns a description, whereas the POST/PUT provide a definition. For example with ogcapppkg/CWL, the GET returns only the processDescription field, whereas the PUT includes both the processDescription and the executionUnit (CWL).

I wouldn't say they are completely different things as one strongly depends on the other (the POSTed process will be the one eventually described).

The GET does describe the definition of what was POSTed, but it is a description, which is fundamentally different from the definition.

Whenever I mentioned Deployable/Execution Workflows or Workflow Description, I am using the same nomenclature as defined in Core and Part 2. In other words, some schema that allows the user to POST nested processes on /processes to deploy a "workflow", which can then be described on GET /processes/{processId}, and executed with POST /processes/{processId}/execution. In a way, you can see it as a "Process that just so happens to define a workflow chain" instead of a single operation.

I would like to avoid using the word description to refer to this and call that a definition, to avoid confusion with the process description returned by GET /processes/{processid} (which does not include the executionUnit), as I was suggesting to @pvretano that we change those instances where the word description is used in Part 2 to definition.

On the other hand, an Execution Workflow (i.e.: Part 3) that is directly POSTed on the execute endpoint (without prior deploy/describe requests), does not have its "workflow chain" defined yet. It is resolved dynamically when processing the execution contents.

I don't understand why you say that the Part 3 execution workflow does not have its workflow chain defined yet? How I understand it is that the Part 3 execution request workflow is the workflow chain. Some detailed aspects of it are resolved dynamically as part of the ad-hoc execution or deployment of the workflow (e.g. format & API negotiation for data exchange a particular hop), but the overrall chain is already defined.

I have not seen anything regarding that. Maybe you can provide a reference? From my understanding, once the Execution Workflow is POSTed, the result obtained as output is the same as when running an atomic process. Considering that, the workflow definition is effectively lost in the engine that applies the "workflow chain". The only workaround to this would be to deploy that workflow before executing it, but again, this poses a lot more problems as previously mentioned.

I would point out that this also applies to the CWL included in the application packages execution unit. A GET /processes/{processId} does not return the executionUnit/CWL, only the description. As I mentioned above, I had originally suggested that a workflow may be exposed for a persistent virtual collection (example), but it also makes perfect sense for workflows deployed as a process as well, e.g. GET /processes/{processId}/workflow (which could return the CWL or the Part 3 execution request workflow).

I will try to find time to address the other points you touched on in the message, but it's a busy week ;)

pvretano commented 2 years ago

@jerstlouis yup, you are right. I'll clean up the wording a bit. I think the correct statement is that you POST a "description" of a process (i.e an application package that includes ithe processes' definition) to the /processes endpoint and you GET the definition of a process from the /processes/{processId} endpoint. There is currently no way through the API to get the "description" (i.e. the application package) of a process but perhaps the endpoint @fmigneault proposed (/processes/{processId}/package) would suffice ... or maybe (/packages/{processId}) would be better.