P3-Workflows: Re-usable components ($ref with JSON Pointers)

opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes

Other

45 stars 45 forks source link

P3-Workflows: Re-usable components ($ref with JSON Pointers) #304

Open jerstlouis opened 1 year ago

jerstlouis commented 1 year ago

Possibly as a separate conformance class, it would be very useful to support re-usable components with $ref and a JSON Pointer to avoid repeating the same redundant (potentially complex) nested process definition within a workflow. In addition, an implementation would easily be able to identify that two nested process objects are identical and therefore their outputs should be re-used (as opposed to executing it twice).

This would also facilitate defining a nested process that generates multiple outputs, then refer to it adding an {"outputs": ... section to select a specific output for its use as a particular input e.g.,

{
   "process": "https://example.com/ogcapi/processes/SomeProcess",
   "inputs" : {
      "orthoPhoto" : { "$ref" : "#/components/photogrammetry", "outputs" : { "orthoPhoto": { } },
      "pointCloud" : { "$ref" : "#/components/photogrammetry", "outputs" : { "pointCloud": { } }
   },
   "components" :
   {
      "photogrammetry":
      {
         "process": "https://example.com/ogcapi/processes/Photogrammetry",
         "inputs" : { ... }
      }
   }
}

See also scenario 4 in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1046682392 , where it is used in a different context (which may be better addressed with a mechanism as part of the Input/Output Fields conformance class to aggregate over a temporal dimension for multiple intervals, as opposed to completely reducing the temporal dimension).

fmigneault commented 1 year ago

I like the idea. In order to align with usual OpenAPI components, I think it would be good to enforce the process definitions to be under #/components/processes/{processId}. That would also allow other types of definitions like repeated #/components/inputs/{inputId} and so on, without confusion about which $ref are related to a process or something else.

jerstlouis commented 1 year ago

@fmigneault I agree, as long as you don't mean to enforce the use of $ref to define processes / inputs etc. so that things remain compatible with Part 1 (this would still be a separate conformance class).

Thanks for the suggestion!

However, a nested Process definition is one type of input to another process, so not sure about /inputs/{inputId} (unless that was a reference to external input in the context of the Deployable Workflows conformance class).

Maybe we should clarify what types of "components" we have.

fmigneault commented 1 year ago

Only the $ref locations (if used) would be enforced. If items are explicitly defined, it remains valid. They should be equivalent after parsing the JSON with some RefResolver implementation.

nested Process definition is one type of input to another process

That made me realize, shouldn't the inputs look something like this?

{
  "inputs" : {
    "orthoPhoto" : { "process": { "$ref" : "#/components/photogrammetry" } }
  }
}

The "process" is the field that indicates the chaining of Workflow I/O. This is not equivalent to a simple "expand JSON from $ref" expression which could apply for any part of the JSON.

jerstlouis commented 1 year ago

That made me realize, shouldn't the inputs look something like this?

@fmigneault The "process" key is the URI to the nested process, not an input object.

This follows Part1 that supports simple JSON types as well as "href", "value" : { <object> } and "bbox". Part 3 introduces the new "process", "collection", and "$input" (parameterizable input for deployed workflow) input types. This also allows to add a "process" key at the top-level so that a workflow can optionally fully describe the process for which it is intended.

The "process" is the field that indicates the chaining of Workflow I/O.

Specifying a process input object is what indicates the chaining. (although you may also consider a nested collection, a remote one in particular, to be workflow I/O).

This is not equivalent to a simple "expand JSON from $ref" expression which could apply for any part of the JSON.

I agree that we could restrict exactly where a $ref can be used.

fmigneault commented 1 year ago

I think the current example has a fundamental problem. Normally, "$ref" instances must be expanded in place where they were referenced.

Currently, the process definition

{
  "photogrammetry":
  {
    "process": "https://example.com/ogcapi/processes/Photogrammetry",
    "inputs" : { }
  }
}

Cannot be extended in place as expected in the below content because of the extra "outputs". That "outputs" must be ignored according to usual OpenAPI behavior, and that changes the interpretation of the submitted job.

{ 
   "inputs" : {
      "orthoPhoto" : { "$ref" : "#/components/photogrammetry", "outputs" : { "orthoPhoto": { } },
   }
}

The correct schema would require a oneOf [ { $ref }, { outputs } ] object. That makes the process inputs less readable in my opinion and much more complicated to define, which is why I was trying to come up with another way to reference the process to allow this kind of definition:

{
  "inputs" : {
    "orthoPhoto" : { "process": { "$ref" : "#/components/photogrammetry" }, "output": { } }
  }
}

If instead, we moved "outputs" within the #/components/photogrammetry definition, there would not be the problem described above. However, since the main reason to have the "#/components/photogrammetry" is to repeat the reference to the same process to avoid duplication, how could we reuse its reference for the different combinations input/output? The process component including the inputs/outputs would not allow reuse with different I/O variation.

jerstlouis commented 1 year ago

@fmigneault Thank you for pointing this out.

Normally, "$ref" instances must be expanded in place where they were referenced.

That is the intent.

Cannot be extended in place as expected in the below content because of the extra "outputs".

It is the intent that the "outputs" are combined with the expanded content at the JSON pointer.

That "outputs" must be ignored according to usual OpenAPI behavior,

That $ref behavior that anything else is ignored (that I find extremely limiting and annoying) is defined by this informative Internet Draft, as well as JSON schema.

JSON Pointer (RFC 6901) itself has no such restriction. So perhaps we could use a key other than $ref e.g., $include to avoid that problem, defining a convenient behavior?

allOf (not oneOf) is the usual work around with JSON Schema there to make that work usually, but it is cumbersome, besides those JSON Schema constructs being mostly overkill here...

fmigneault commented 1 year ago

It is normal RFC-6901 doesn't mention the limitation of $ref, as it concerns the pointer itself (the URL string format) and not $ref nomenclature. The ignore behavior is explicitly mentioned in JSON schema, so we should not introduce diverging resolution methods.

Since all other definitions in the API use OpenAPI standard keywords, I would personally rather not have new keywords that could be up to interpretation by different tools. If implementers need to code additional parsing over a typical JSON schema resolver, I think the proposal is heading the wrong way.

You are right bout allOf, I mixed them up.

jerstlouis commented 1 year ago

@fmigneault

The ignore behavior is explicitly mentioned in JSON schema, so we should not introduce diverging resolution methods.

Since all other definitions in the API use OpenAPI standard keywords, I would personally rather not have new keywords that could be up to interpretation by different tools. If implementers need to code additional parsing over a typical JSON schema resolver,

Right, but this execution request has nothing to do with JSON Schema or OpenAPI, and as you point out $ref is of limited use to the use case. A special parser (or understanding of the execution request using a regular JSON parser) is already needed for the execution request.

We can try to come up with a better solution, but I think:

The ability to re-use objects is important. Inputs in particular, and think it should be agnostic of what type of inputs it is: could be a process, a collection, an href, a bbox, an inline value..., so that one would be able to change that input independently in the workflow definition.
The ability to "augment" n process input with "outputs : { ... }" is important so as to be able to pick an individual output

If we really want to stick to the JSON Schema-like $ref behavior, the following, which is a bit more verbose, could potentially work:

{
   "process": "https://example.com/ogcapi/processes/SomeProcess",
   "inputs" : {
     "orthoPhoto" : {
     {
        "process" : { "$ref" : "#/components/photogrammetry/process" },
        "inputs" : { "$ref" : "#/components/photogrammetry/inputs" },
        "outputs" : { "orthoPhoto": { } }
     }
     "pointCloud" : {
         "process" : { "$ref" : "#/components/photogrammetry/process" },
         "inputs" : { "$ref" :  "#/components/photogrammetry/inputs" },
         "outputs" : { "pointCloud": { } }
     }
   },
   "components" :
   {
      "photogrammetry":
      {
         "process": "https://example.com/ogcapi/processes/Photogrammetry",
         "inputs" : { ... }
      }
   }
}

If the exact same input object is to be re-used with no modification at multiple places, you could also have another level, e.g.:

{
   "process": "https://example.com/ogcapi/processes/SomeProcess",
   "inputs" : {
     "orthoPhoto" : { "$ref": "#/components/photogrammetry-ortho" },
     "pointCloud" : { "$ref": "#/components/photogrammetry-pointCloud" }
   },
   "components" :
   {
      "photogrammetry":
      {
         "process": "https://example.com/ogcapi/processes/Photogrammetry",
         "inputs" : { ... }
      },
      "photogrammetry-ortho":
     {
        "process" : { "$ref" : "#/components/photogrammetry/process" },
        "inputs" : { "$ref" : "#/components/photogrammetry/inputs" },
        "outputs" : { "orthoPhoto": { } }
     },
      "photogrammetry-pointCloud":
      {
         "process" : { "$ref" : "#/components/photogrammetry/process" },
         "inputs" : { "$ref" :  "#/components/photogrammetry/inputs" },
         "outputs" : { "pointCloud": { } }
     }
   }
}

but this is all more verbose than, e.g.:

{
   "process": "https://example.com/ogcapi/processes/SomeProcess",
   "inputs" : {
     "orthoPhoto" : { "$include-properties": "#/components/photogrammetry", "outputs" : { "orthoPhoto": { } } },
     "pointCloud" : { "$include-properties": "#/components/photogrammetry", "outputs" : { "pointCloud": { } } }
   },
   "components" :
   {
      "photogrammetry":
      {
         "process": "https://example.com/ogcapi/processes/Photogrammetry",
         "inputs" : { ... }
      }
   }
}

fmigneault commented 1 year ago

this execution request has nothing to do with JSON Schema or OpenAPI

This is true, but since the schema field for defining I/O is heavily based on OpenAPI, I think it will only confuse developers and users more if we employ different but very similar paradigms. The fact that special parsing would be required only for that case is already a red flag of its design.

How about something like this? Since the process is already an external reference, I don't see the gain of wrapping it in a $ref. Using $ref this way specifically for repeated I/O also indicates which inputs and outputs are shared/reused across the processes.

{
  "process": "https://example.com/ogcapi/processes/SomeProcess",
  "inputs" : {
    "orthoPhoto" : { 
      "process": "https://example.com/ogcapi/processes/Photogrammetry",
      "inputs": { 
        "input-1": { "$ref": "#/components/inputs/photogrammetry-input" },
        "input-2": { "$ref": "#/components/inputs/shared-input" }
      }, 
      "outputs": { "$ref": "#/components/outputs/photogrammetry-output" }
    },
    "pointCloud" : { 
      "process": "https://example.com/ogcapi/processes/PointCloud",
      "inputs": { 
        "input-1": { "$ref": "#/components/inputs/pointCloud-input" }, 
        "input-2": { "$ref": "#/components/inputs/shared-input" }
      },
      "outputs": { "$ref": "#/components/outputs/pointCloud-output" }
    }
  },
  "components" : {
    "inputs": {
      "photogrammetry-input": { "..." },
      "pointCloud-input": { "..." },
      "shared-input": { "..." }
    },
    "outputs": {
      "photogrammetry-output": { "orthoPhoto": { } },
      "pointCloud-output": { "pointCloud": { } }
    }
  }
}

jerstlouis commented 1 year ago

@fmigneault In your last example, there are different processes generating the pointCloud output and photogrammetry output. One of the use case I was trying to address was the same process & inputs generating multiple outputs, and using the different outputs as different inputs in the workflow.

Another important use case is the exact same output used in different places. But having to piece them together (process, inputs, outputs) like in your example there does, it does not guarantee / clearly indicate that it is the same thing being re-used (e.g., which inputs to use as the orthoPhoto for SomeProcess would need to be changed in multiple places).

This is true, but since the schema field for defining I/O is heavily based on OpenAPI, I think it will only confuse developers and users more if we employ different but very similar paradigms. The fact that special parsing would be required only for that case is already a red flag of its design.

The schema is for the process description though, not the execution request. So it is somewhat removed... I disagree in calling design red flags, $ref has this odd behavior because you define it as a JSON object, but that object might end up replaced by a string or number instead... That is why anything added to it needs to be ignored. If we were come up with our own thing, then we could make our own rules (this is extended OGC Processes Execution requests after all, not JSON Schema, and it doesn't use JSON schema). In my last example I called it "$include-properties" to make it crystal clear that it is the properties of the referenced object that get included, unlike with $ref. But I do understand your reservations about sticking to existing patterns / behaviors.

Since the process is already an external reference, I don't see the gain of wrapping it in a $ref.

The gain is that the process URI appears in only one location, so if you wish to change it to another process or server, you only need to change it in one place.

If we really want to use $href and have its usual JSON-schema-like behavior where the whole thing gets replaced and nothing can be added to it, honestly I feel that the first and second examples here are our best option.

fmigneault commented 1 year ago

One of the use case I was trying to address was the same process & inputs generating multiple outputs, and using the different outputs as different inputs in the workflow.

You can simply replace "process": "https://example.com/ogcapi/processes/PointCloud" by "process": "https://example.com/ogcapi/processes/Photogrammetry" and that case would also be valid in https://github.com/opengeospatial/ogcapi-processes/issues/304#issuecomment-1307425166.

Whether the same process is referenced by $ref or directly with its URL, it is the combination of process + inputs + outputs that indicates if the same process execution can be reused for multiple inputs/outputs. That resolution should be done by the workflow engine (not the JSON ref-resolver) after parsing whichever representation, because there is special logic to apply in order for the real process to be executed with a combined outputs from both orthoPhoto and pointCloud to allow chaining.

In other words, the real call to "https://example.com/ogcapi/processes/Photogrammetry" would need to be accomplished using { "outputs": { "orthoPhoto": { }, "pointCloud": { } } to perform a single run, and then chaining those outputs respectively for the inputs of the parent process. That logic to combine the outputs to call Photogrammetry is not directly encoded in the submitted JSON either with preserved or resolved $ref. It is the workflow engine that detects this pattern and makes this optimization. Reference resolution and "workflow pattern" resolution should not be mixed.

The implemention could decide to resolve all $ref items beforehand, in order to read the full JSON with its expanded definition. The way $ref are used in your examples, not using them would indicate that the workflow engine should forcefully duplicate the Photogrammetry process execution when it doesn't have to? Workflow parsing should result in the same behavior whether the $ref or copy-pasted definitions were used.

The schema is for the process description though, not the execution request. So it is somewhat removed...

Yes, but it is much easier to use the same one still. To look at what must be submitted for execution, one has to look at the process description. Going back and forth between different paradigms just makes the API harder to employ.

The gain is that the process URI appears in only one location, so if you wish to change it to another process or server, you only need to change it in one place.

Ok. That's a valid use, but not a requirement for detecting reuse of the same process execution over multiple I/O.

first and second examples here are our best option.

I agree as long as the use of $ref are not hard requirements to detect the "reuse process/inputs/outputs" pattern.

Maybe we should consider using $id to indicate which processes/inputs/outputs are reused or not, and still allow the use of $ref with the usual JSON resolution? I've given a try with the below example, and I think it can resolve all combinations of referencing. It also is more explicit about which items are reused or not.

process: "https://example.com/ogcapi/processes/SomeProcess"
inputs:
  orthoPhoto: 
    $id: "#/components/processes/Photogrammetry/outputs/orthoPhoto"
  pointCloud: 
    $id: "#/components/processes/Photogrammetry/outputs/pointCloud"
  otherCloud:
    $id: "#/components/processes/OtherPhotogram/outputs/pointCloud"

components:
  $defs:
    Photogrammetry:
      process: "https://example.com/ogcapi/processes/Photogrammetry"

  processes:
    Photogrammetry:
      process:
        $ref: "#/components/$defs/Photogrammetry/process"
      inputs: 
        input-1: 
          $ref: "#/components/inputs/photogrammetry-input"
        input-2:
          $id: "#/components/inputs/shared-input"
      outputs:
        orthoPhoto: {}
        pointCloud: {}
    OtherPhotogram:
      process:
        $ref: "#/components/$defs/Photogrammetry/process"
      inputs: 
        input-1: 
          $ref: "#/components/inputs/pointCloud-input"
        input-2:
          $id: "#/components/inputs/shared-input"
      outputs:
        pointCloud: {}

  inputs:
    photogrammetry-input: "..." 
    pointCloud-input: "..."
    shared-input: "..."

jerstlouis commented 1 year ago

@fmigneault

I agree as long as the use of $ref are not hard requirements to detect the "reuse process/inputs/outputs" pattern.

It is the workflow engine that detects this pattern and makes this optimization. Reference resolution and "workflow pattern" resolution should not be mixed.

The implemention could decide to resolve all $ref items beforehand, in order to read the full JSON with its expanded definition. The way $ref are used in your examples, not using them would indicate that the workflow engine should forcefully duplicate the Photogrammetry process execution when it doesn't have to? Workflow parsing should result in the same behavior whether the $ref or copy-pasted definitions were used.

In all examples, the Processes Part 1-3 / workflow engine implementation is really free to do whatever it wants as long as it conforms to the client's expectations. It could detect duplicate outputs always, or only when the workflow is defined in a certain way, as long as the end-result is the same from the client's perspective. Of course, ideally, it would always do the optimal thing, regardless of how the definition is set up.

Maybe we should consider using $id to indicate which processes/inputs/outputs are reused or not

Is $id something used elsewhere with the same semantic? I think in JSON Schema it is used to "declare / identify" an element, as opposed to referencing one, isn't it?

Here, it introduces yet another type of input being specifically one output of a process. I like the pattern, but for it to be JSON Pointer it forces you to declare the outputs in the process even if you want all of them.

fmigneault commented 1 year ago

Indeed, $id is used only to identify a resource uniquely within the JSON. The specific value looking like a $ref locations does not perform any "unwrapping" of the references like $ref does. That is the whole purpose of it. Using the #/components/... path as an ID, we are certain that the specific reference is used (and preserved after JSON $ref resolution), and if repeated in multiple places, it refers to the same item. Any other value for $id could be used, but I think it is more intuitive and explicit with the path to the resource, since there is no mistaking which one the $id refers to. The workflow engine could then easily detect those repeated $id cases and know when process execution/inputs/outputs are reused across multiple locations.

jerstlouis commented 1 year ago

@fmigneault but my point was that this is not what $id does at all in JSON Schema, e.g. https://json-schema.org/understanding-json-schema/basics.html#declaring-a-unique-identifier .

{ "$id": "http://yourdomain.com/schemas/myschema.json" }

This identifies the document as a whole with a URI (where the schema can also be retrieved) i.e., it says "this is this id" , as opposed to referencing something ("use that thing with this id over there"). There is also now "$anchor" which works like ID (it identifies rather than references), but works with local resources as opposed to full URIs.

Also, I had missed this so far:

https://json-schema.org/draft/2019-09/release-notes.html#keyword-changes

$ref Changed Other keywords are now allowed alongside of it

It is equivalent to using allOf. So I think we should just go with my original proposal since JSON Schema actually has the same rules, and not base the decisions on the historical draft and older OpenAPI version limitations.

OpenAPI also supports it as of version 3.1 that is fully aligned with JSON Schema.

fmigneault commented 1 year ago

@jerstlouis Ok, $ref seems more reasonable with the new changes. The body should include $schema with that minimal version for correct interpretation by parsers.

However, I still think there should be something more than just $ref, such as a requirement that $ref MUST be pointing to something tagged by $id or an item under $def (or any other suggestion). Otherwise, it would be perfectly valid to use $ref in other situations not specific to the workflow chain within the body. Some $ref would be seen as "reusable process execution/inputs/outputs" by the workflow engine, while others would be plain JSON references without any special meaning for the workflow.

I think this diverging interpretation could lead to non-interoperable servers because the way the workflow engine would perform $ref resolution could have non-deterministic results.

jerstlouis commented 1 year ago

@fmigneault

The body should include $schema with that minimal version for correct interpretation by parsers.

Well, the schema for the execution request is NOT JSON Schema, so that is not applicable. Rather, we would clarify the behavior in the conformance class defining support for $ref. There we could potentially cite JSON schema and that version as a reference. But it is not directly applicable, since this is not JSON Schema.

$ref MUST be pointing to something tagged by $id In JSON Schema, $id is only used for the schema as a whole.

an item under $def (or any other suggestion)

I was not really familiar with $defs (plural) (as opposed to just having a "definitions" key).

With $defs in JSON Schema you have a dictionary of keys to sub-schemas, where each schema can then have its own $id.

So perhaps we could indeed come up with something following that model.

Otherwise, it would be perfectly valid to use $ref in other situations not specific to the workflow chain within the body.

That might actually be relevant and useful in some situations. So we should discuss.

Some $ref would be seen as "reusable process execution/inputs/outputs" by the workflow engine, while others would be plain JSON references without any special meaning for the workflow.

Ideally, the implementation should be able to identify re-usability irrespective of whether things are using $ref or not. Even if it only supports doing so where $ref is used, where $ref is used in a context where re-usability is not applicable, it should not have any impact.

I think this diverging interpretation could lead to non-interoperable servers because the way the workflow engine would perform $ref resolution could have non-deterministic results.

I think the workflow definition should have at least some level of determinism regardless of the path taken by the workflow engine. I think it would be better to address such concern at the level where executing the same process with same inputs one or more times produce identical results, than putting constraints on the workflow engine having to execute things a certain way. (considering that you would want these workflows to scale to multiple nodes and kubernetes etc. anyhow).

fmigneault commented 1 year ago

Ideally, the implementation should be able to identify re-usability irrespective of whether things are using $ref or not. Even if it only supports doing so where $ref is used, where $ref is used in a context where re-usability is not applicable, it should not have any impact.

I agree that the Workflow definition should not depend on the use of $ref and should be able to resolve reusable process execution/inputs/outputs with a specific JSON structure regardless of the use of $ref or not. For that very reason, the original idea that the Workflow would easily be able to identify those situations with $ref "hints" becomes invalid. Using $ref in a Workflow definition should not have any special meaning for the Workflow resolution itself.