opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes
Other
46 stars 45 forks source link

Part 3 Deployable Workflows - Analysis and Proposals #431

Open fmigneault opened 2 months ago

fmigneault commented 2 months ago

The Part 3: Deployable Workflows proposes an alternate deployment definition based on an execution body, trying to bridge Part 1/2/3. I would like to validate my understanding of it, and propose adjustments to improve alignments (as applicable).

Since there are 2 variants for deployment, the 2 are analyzed separately, but using an equivalent workflow example.

Variant 1: Direct Deployment with Execution Body

Analysis

POST /processes
Content-Type: application/ogc-workflow+json

{
  "id": "DeployWorkflow",
  "version": "1.0",
  "process": "https://example.com/proceses/MainProcess",
  "inputs": {
    "main-in": {
      "process": "https://example.com/proceses/NestedProcess",
      "inputs": {
        "arg": { "$input": "wf-input" }
      }
    }
  },
  "outputs": {
    "out": { "$output": "wf-output" }
  }
} 

A new process named DeployWorkflow with input wf-input and output wf-output would be created. The schema definition of wf-input would be the same as the one of arg from NestedProcess, whereas the schema of the wf-output would be equivalent to out of MainProcess.

Proposals

  1. Add an id field, which is not present in processes-workflows/execute-workflows.yaml.

    • It is normal that id was missing considering it is not required for execution only. However, some process ID is needed to perform the deployment.
  2. Alternatively to id, reuse ?w=<id> query (https://github.com/opengeospatial/ogcapi-processes/blob/master/openapi/parameters/processes-dru/w-param.yaml)

  3. Other required parameter version from Part 1 needs to be added. Since there is no equivalent query parameter, so it might be better to have the Part 3: Deployable Workflows schema be a oneOf[ process-core/processSummary, processes-workflows/execute-workflows ]

    This should be added to OpenAPI path /processes.

  4. Introduce application/ogc-workflow+json (or some equivalent) to distinguish from other deployment structures already supported (CWL, OGC App Pkg, etc.).

  5. This variant doesn't indicate how additional metadata for the resolved wf-input and wf-output can be defined. Recommendations to had to the document, either:

    • Consider this acceptable, meaning that they copy entirely what arg/out defined, nothing more, nothing less.
    • Allow additional properties to be indicated next to $input/$output to extend/override what arg/out provide.
    • Recommend to use Variant 2 instead for this use case.

Variant 2: Embedded Deployment of Execution Body in Execution Unit

Analysis

POST /processes
Content-Type: application/ogcapppkg+json

{
  "processDescription": {
    "id": "DeployWorkflow",
    "version": "1.0"
  },
  "executionUnit": {
    "format": { "mediaType": "application/ogc-workflow+json" },
    "value": {
    "process": "https://example.com/proceses/MainProcess",
    "inputs": {
      "main-in": {
        "process": "https://example.com/proceses/NestedProcess",
        "inputs": {
          "arg": { "$input": "wf-input" }
        }
      }
    },
    "outputs": {
      "out": { "$output": "wf-output" }
    }
    }
  }
}

Proposals

  1. Because wf-input/arg and wf-output/out schemas should be aligned to be mapped correctly, redefining inputs and outputs with schemas explicitly in processDescription is redundant. However, this would not be disallowed according to processes-core/process.yaml.

    • Recommendations should be given in the standard document about this case.

      More specifically, processDescription.inputs and processDescription.outputs could be relevant to provide additional details, such as process-core/descriptionType.yaml metadata properties. However, adding any inputs/outputs there would fail validation if the schema is omitted, since it is required in their definitions. Because of this, we end up going back to redundant schema definitions mentioned above.

    Possible recommendations:

    1. Use

      {
       "inputs": {
         "wf-input": {
           "title": "Workflow Input",
           "schema": {}
         }
       }
      }

      And indicate that schema should be inferred by $input in this deployment use case.

    2. Recommend to explicitly reference the schema:

      {
       "inputs": {
         "wf-input": {
           "title": "Workflow Input",
           "schema": {"$ref": "https://example.com/proceses/NestedProcess#/inputs/arg/schema"}
         }
       }
      }
  2. If Part 3: Fields Modifiers are thrown in the mix of Deployable Workflows, notably for the wf-input and wf-output, then the schema mapping between wf-input/arg and wf-output/out could actually differ entirely.

    In this case, contrary to previous (1), schema under processDescription.inputs and processDescription.outputs could become mandatory. This is because, without any reference schema from DeployWorkflow (yet to be deployed), the workflow could be validated if they were omitted, since there would be no indication of the intended source and desired result for field-modifed wf-input/wf-output.

  3. Improve the description of Part 3: Deployable Workflows regarding media-type. The requirement mentions using application/ogcapppkg+json, but this can easily be confused with the case where processes-dru/executionUnit.yaml is employed directly. When an embedded execution unit definition is used, it is preferable to employ the qualified value with application/ogc-workflow+json to avoid ambiguity about the package contents (or use Variant 1 directly instead).

jerstlouis commented 2 months ago

Thanks for looking into this @fmigneault .

In general, I think your understanding how this is proposed to work is correct.

I was a bit confused at first when you talked about 2 different variants, but yes the intent was that:

I thought we already had an id in there for deployment. An alternative would be to support a PUT to /processes/{processId} to create the resource (in addition to replacing), or possibly request headers for additional metadata.

Other required parameter version from Part 1 needs to be added.

This is required in the process description. Potentially if it is not there the server could automatically version it... But yes, would make sense to be able to specify a version there directly.

This variant doesn't indicate how additional metadata

Which additional metadata are we missing at the input / output level?

And indicate that schema should be inferred by $input in this deployment use case.

Is that because schema is required? In this case, if that's allowed, I would suggest to use null. I would prefer that over the $ref.

In this case, contrary to previous (1), schema under processDescription.inputs and processDescription.outputs could become mandatory.

Only the derived fields / field selector modifiers (properties=) would modify the schema, and it is very clear what the resulting fields are in this case: a subset of the fields a returned or new fields are computed from the existing ones. Since you have the schema of all the existing fields, you can also easily infer the results of applying operation(s) on them (assuming of course that the processing engine understands and parses the CQL2 expressions, even if it does pass them on to the remote server).

The requirement mentions using application/ogcapppkg+json, but this can easily be confused with the case where processes-dru/executionUnit.yaml is employed directly. When an embedded execution unit definition is used, it is preferable to employ the qualified value with application/ogc-workflow+json to avoid ambiguity about the package contents (or use Variant 1 directly instead).

I am confused. The mention about using application/ogcapppkg+json is using DRU with the "OGC Application Package", where the content of the execution unit is the application/ogc-workflow+json workflow. My understanding of application/ogcapppkg+json is that it is agnostic of the execution unit content -- not limited to CWL or anything in particular. What am I missing?

No objection to using application/ogc-workflow+json, but these are the media types currently suggested in the spec:

The exec req suggests we could align this with the media type for POST to /execution.

pvretano commented 2 months ago

For what it's worth ... I prefer proposal 2 since it makes a workflow just another execution unit. Not special or different from any other execution unit.

I am not sure I understand all the contortions about wf-input and wf-output but it seems that there is a desire to reuse some parts of the process description ... specifically the metadata portions ... and the fact that schema is mandatory gums that up because then you have to "duplicate" the schema already expressed in the execution unit.

To that I would say that we make "schema" optional in the process description. If the schema of the inputs/outputs can be inferred from the execution unit then you don't need to include a schema for the input/output in the process description. If the schema of the input/output cannot be inferred from the execution unit then schema is mandatory. That way, an application package can be created that includes an execution unit (like CWL) but also include additional metadata annotations via the OGC Process Description.

fmigneault commented 2 months ago

@jerstlouis

An alternative would be to support a PUT to /processes/{processId} to create the resource

That wouldn't work because the processId would not be defined for the subsequent PUT. The process ID must be available from the get-go during the POST deploy request. The important thing to highlight from the examples is that DeployProcess cannot be the same as "process": "https://example.com/proceses/MainProcess". The MainProcess must already exist in order to extract its I/O schema definitions, which are used to resolve the types referenced by the wf-input and wf-output.

Which additional metadata are we missing at the input / output level?

In case the MainProcess did not provide any title, description, keywords or metadata, the corresponding wf-input and wf-output deployment could want to provide them. It could also want to override their content to make them more relevant/detailed in the context of the new workflow, which might not expose all the I/O offered by MainProcess, or any of other nested processes.

Is that because schema is required?

Yes, that's the reason, ie: https://github.com/opengeospatial/ogcapi-processes/blob/b972e74d8a09b36c1fc54869b9bfe7f44d1fd20f/openapi/schemas/processes-core/inputDescription.yaml#L4-L5 https://github.com/opengeospatial/ogcapi-processes/blob/b972e74d8a09b36c1fc54869b9bfe7f44d1fd20f/openapi/schemas/processes-core/outputDescription.yaml#L4-L5

I would suggest to use null.

If this is the preference over explicit $ref, then I suggest the other proposal that uses {}. Using {} does not require any modification to the schema of schema, since it already allows an object without any property: https://github.com/opengeospatial/ogcapi-processes/blob/b972e74d8a09b36c1fc54869b9bfe7f44d1fd20f/openapi/schemas/processes-core/schema.yaml#L3-L4

Only the derived fields / field selector modifiers (properties=) would modify the schema, and it is very clear what the resulting fields are in this case

Yes (field selectors), and no (not clear/easy).

For example, if the NestedProcess only defined that schema was the generic GeoJSON (any type), and the field modifiers did some CQL2 filtering of only type: Point while adding a new custom property, the resulting wf-output could be an entirely other (and more specific/narrowed) schema reference and format defined by the user that has [type, custom] requirements for a point. The custom field could even come from different parts of the workflow, making parsing of the field modifiers very complicated.

Since field modifiers can completely redefine the output however they want, it is not that trivial to infer the resulting schema. I also see this as a great opportunity for users to define "converter workflows" where specific output schema could be provided, and which could be "injected" only by the workflow creator that has the knowledge about the resulting schema they dynamically created.

My understanding of application/ogcapppkg+json [...]

Your understanding is correct. The only issue about ONLY using Content-Type: application/ogcapppkg+json is that it makes whatever is contained in executionUnit very ambiguous, since they are all JSON with similar/complementary field names (inputs, outputs, etc.).

Omitting the qualified value representation with application/ogc-workflow+json should default to using the processes-dru/executionUnit.yaml in case of ambiguity. That doesn't mean you can't try POST'ing the workflow directly and have it automatically "detect" application/ogc-workflow+json, but I would rather have the standard define a "Best Practice" to include it explicitly to make resolution consistent across implementations.

application/ogcexec+json

I missed this type when reading the draft. It is acceptable as well.

However, I believe application/ogc-workflow+json would be more "explicit" about the fact that an OGC Part 3 Workflow is POST'd rather than any other execution body. The $input and $output of Deployable Workflows are necessary for this definition to make any sense. For the same reason, it is technically "not exactly" the same as an ordinary execution request, since it cannot be executed by itself (values to fill in the I/O referenced by $input/$output would be missing).

fmigneault commented 2 months ago

@pvretano

I am not sure I understand all the contortions about wf-input and wf-output [...]

The advantage (and main purpose) of $input and $output are that they can be placed at any level in the workflow. Therefore, the DeployWorkflow that would be created could only expose wf-input and wf-output as "top-level" I/O in its process description, but those references could be passed down/retrieved to/from very-deeply nested processes, or even be reused at multiple places in the workflow.

Reusing schema from the referenced $input/$output is a bonus to avoid duplicating them for wf-input/wf-output, but it is not mandatory. The workflow could define I/O schema with more explicit conditions, but they must be compatible with the places where they are passed down/retrieved from.

To that I would say that we make "schema" optional in the process description.

That would be a valid alternative, as long as it is only in the context of application/ogc-workflow+json deployment to avoid the explicit schema: {} to patch JSON-schema validation. I believe the I/O schema MUST remain required for process descriptions to make any sense. It's the only field left to indicate what the I/O are.

That way, an application package can be created that includes an execution unit (like CWL) but also include additional metadata annotations via the OGC Process Description.

This is valid as well. This is actually exactly what CRIM's implementation does ;) (see https://pavics-weaver.readthedocs.io/en/latest/package.html#correspondence-between-cwl-and-wps-fields and https://pavics-weaver.readthedocs.io/en/latest/package.html#metadata)