opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes
Other
46 stars 45 forks source link

Workflow examples and use cases ... #279

Open pvretano opened 2 years ago

pvretano commented 2 years ago

Following on from issue #278, the purpose of this issue is to capture examples of workflows from the various approaches (OpenEO, OAPIP Part 3, etc.), compare them and see where there is commonality and where there are differences. The goal is to converge on some conformance classes for Part 3.

Be specific with the examples, provide code if you can and try to make then not-too-long! ;)

pvretano commented 2 years ago

@jerstlouis GET gets the definition of a process. There is no way to GET the description of the process where description is the definition PLUS other information an OAProc endpoint needs to be able to actually deploy a process. The description of the process is what we call the application package.

Is everyone in agreement with this terminology?

jerstlouis commented 2 years ago

@pvretano It's the other way around :)

You POST a definition at /processes, and you GET a description at /processes/{processId}.

We don't yet have a GET operation for retrieving the definition, but I had suggested GET /processes/{processid}/workflow and Francis implemented GET /processes/{processId}/package so that it is not specific to workflows.

How about GET /processes/{processId}/definition or GET /processes/{processId}/executionUnit ?

fmigneault commented 2 years ago

@jerstlouis

I would like to avoid using the word description to refer to this and call that a definition, to avoid confusion with the process description returned by GET /processes/{processid} (which does not include the executionUnit), as I was suggesting to @pvretano that we change those instances where the word description is used in Part 2 to definition.

Fine with me.

I don't understand why you say that the Part 3 execution workflow does not have its workflow chain defined yet? How I understand it is that the Part 3 execution request workflow is the workflow chain.

At the moment the request is submitted with details on how to chain I/O, the Workflow is not yet defined from the point of view of the API. After the contents are parsed, some workflow definition can then be dumped to file, database or held in memory by the runner that will execute it, then the workflow exists. I'm just pointing out that using a deployed workflow, the API doesn't even need to parse the payload, it is already aware of the full workflow definition. Because these different workflow interpretations happen at different times, it is important to properly identify them, to avoid the same confusion as for the process description/definition.

@pvretano I personally prefer to have package under the /processes/{processId}/package because it is tightly coupled with the process.

pvretano commented 2 years ago

Yikes. Stop! @jerstlouis @fmigneault Please chime in with ONE WORD answers. I don't want an essay! ;) What do we call what you get from GET /processes/{processId}? A definition or a description? I call it a definition. What do we call what you POST to /processes to deploy a process? A definition or a description? I call it a description. What do we call what your would get from /processes/{processId}/package? A definition or a description? I call it description.

jerstlouis commented 2 years ago

@pvretano

GET /processes/{processId? A description -- That's what it is called in Part 1. POST /processes -- A definition. GET /processes/{processId}/(package / definition / executionUnit) -- A definition.

fmigneault commented 2 years ago

What do we call what you get from GET /processes/{processId? description What do we call what you POST to /processes to deploy a process? description + executionUnit (or package :P) aka definition What do we call what your would get from /processes/{processId{/package? executionUnit/package only

pvretano commented 2 years ago

So, we GET a description and we POST a definition. I will update the terminology in Part 2 accordingly! OK?

pvretano commented 2 years ago

Excellent! Progress ... :)

pvretano commented 2 years ago

Created issue #282 to resolve the definition versus description terminology issue in part 2. Please review and add comments about the question I pose in #282. ... and make them SHORT comments please! ;)

jerstlouis commented 2 years ago

@fmigneault About:

I also find that POSTing the "workflow chain" each time on the execution endpoint doesn't align with deploy/describe concepts. The whole point of deploy is to persist the process definition and reuse it. Part 3 redefines the workflow dynamically for each execution request, requiring undeploy/re-deploy or replace each time, to make it work with Part 2.

The MOAW workflows (Part 3 execution request-based workflow definitions) can either be used to define deployable workflows and deployed with Part 2, or executed in an ad-hoc manner by POSTing them to an execution end-point -- both options are possible (separate capabilities: a server could support either or both). Both ad-hoc execution and deployed workflows could also make sense with CWL and OpenEO process graphs.

Alternatively, if undeploy/re-deploy/replace is not done each time, and that the "workflow chain" remains persisted, then why bother re-POSTing it again as in Part 3 instead of simply re-using the persisted definition? They are not complementary on that aspect.

Part 3 defines the "ac-hoc workflow execution" capability as a way to allow using pre-deployed (local and/or remote) processes (i.e. NestedProcess/RemoteProcess) and (local and/or remote) collections (i.e. CollectionInput/RemoteCollection) which does not require the client to have access to deploy new processes. With the CollectionOutput capability, even "ad-hoc workflow execution" can be POSTed only once, and data can be retrieved from it for many different regions / resolutions without having to POST the workflow for each process-triggering data request.

It is not exactly the same though. For Execution Workflow, we need to add more details such as the outputs in the nested process to tell which one to bubble up to the parent process input. It is not a "big change", but still a difference.

The selection of "outputs" is a capability already in the Core execution request. Nested processes is really the only extension for ad-hoc execution.

A pre-deployed/described Workflow would not need this information, since all details regarding the "workflow chain" already exist. Only in that case, the execution request is exactly the same syntax as for any process execution.

The DeployableWorkflows are what needs the wiring of inputs/outputs of the overall process being deployed to the inputs/outputs of the processes internally, so that is another extension specific to that capability.

Still in both cases, it's the exact same execution request schema with very specific extensions.

fmigneault commented 2 years ago

@jerstlouis You really need to explain how this deployment with "ad-hoc workflow execution" works with a concrete example. I don't see how it can happen. If you POST on the execution endpoint (async or sync), you either receive the job status/location or the outputs directly. Where is the deployed workflow information? How do you provide details about which processID to deploy it as? How can the user making that request know where is the deployed process to describe it or execute it again without re-POSTing the workflow?

jerstlouis commented 2 years ago

@fmigneault

If you POST on the execution endpoint (async or sync), you either receive the job status/location or the outputs directly

Correct, plus Part 3 introduces the CollectionOutput and LandingPageOutput execution modes returning a collection description and landing page respectively (with client then triggering processing via data access requests, e.g. Coverages or Tiles).

Where is the deployed workflow information?

I think we are lost in terminology again, because what I mean by "ad-hoc workflow execution" (POSTing directly to an execution end-point) is the polar opposite of "deployed workflow". However, in the case of CollectionOutput and LandingPageOutput, you could include a link to the "workflow definition" in the response. I imagine this link could also be included in the case of a job status / results document response.

How do you provide details about which processID to deploy it as?

The "ad-hoc workflow execution" is to avoid having to deploy it as a process. (e.g. there are fewer safety issue with executing already deployed processes vs. deploying new ones; or an EMS may only execute processes but not have ADES capabilities).

How can the user making that request know where is the deployed process to describe it or execute it again without re-POSTing the workflow?

In the case of CollectionOutput and LandingPageOutput, the client just makes different OGC API data request from the links in the response. In Sync / Async mode, the user cannot -- they need to submit another ad-hoc execution (that's why it's an ad-hoc execution: no need to deploy first).

Now in contrast to the "ad-hoc workflow execution", the "deployable workflow" is what you can deploy as a process, using Part 2. That can be done with CWL, or OpenEO, or a MOAW workflow (extended from the Processes - Part 1: Core execution request + nested processes + input/output wiring of overall process to internal processes) in the execution unit. That execution unit can be included in an "OGC JSON Application Package", or be directly the Content-Type POSTed to /processes.

Does that make things more clear?

fmigneault commented 2 years ago

@jerstlouis
It brings some clarifications but there are still some items I'm not sure to understand.

So if I follow correctly, the deployment of this ad-hoc workflow could be defined and referenced by a link for description provided in LandingPageOutput, but there is no methodology or schema provided by Part 3 to indicate how this deployment would be done, nor even what a MOAW workflow definition would look like? (note: I don't consider the payload in the execution body a definition itself because it employs values, which cannot be deployed as is to create a process description with I/O types. It's more like making use of the definition, but it would be wrong to have specific execution values in the process description).

If that is the case, I don't think it is fair to say "The MOAW workflows [...] can either be used to define deployable workflows" if an example of a workflow definition inferred from the execution chain is not provided as example. It seems to contradict with "they need to submit another ad-hoc execution". What would a MOAW workflow even look like then when calling GET /processes/{processId}/definition with application/moaw+json ? Does it even make sense to have something returned by that request, since it is effectively ignored and re-submitted with a potentially different ad-hoc execution workflow?

jerstlouis commented 2 years ago

@fmigneault

What would a MOAW workflow even look like then when calling GET /processes/{processId}/definition with application/moaw+json

It would look like the Part 1 execute request, with the following two extensions:

I don't consider the payload in the execution body a definition itself because it employs values, which cannot be deployed as is to create a process description with I/O types.

I am not sure I understand your view on this... If you consider a workflow with single hop, it is identical to a Processes - Part 1: Core. If you have 1 nested process, the top-level process receiving workflow acts as a Processes - Part 1: Core client with that nested process. So since it works for one hop, why wouldn't it work with any number of hops?

since it is effectively ignored and re-submitted with a potentially different ad-hoc execution workflow?

I don't understand what you mean by this... It seems like you might possibly be mixing up the execution request invoking the blackbox process vs. the execution request defining the workflow that invokes processes internally (not the blackbox process). Could that be the case?

fmigneault commented 2 years ago

It is not really about the number of hops. There is no issue about quantity of nested processes or how they connect to each other. The issue it about the content of the execution payload.

When the ad-hoc workflow is submitted for execution, the values are embedded in the body (this is fine in itself, no problem). Very simplified:

{  
    "process": "url-top-most",
    "inputs": { 
        "input-1": {
            "process": "url-nested",
            "inputs": {
               "some-input":  "<some-real-data-here raw|href|collection|...>"
            },
            "outputs": { "that-one": {} }
        }
    }
}

The problem happens when trying to explain the behaviour between Part 2 and Part 3. The above payload is not a direct definition.

Let's say there was a way for the user to indicate they want that exact chain to be process mychain (ie: POSTing it on /process), and deploy it with Part 2 using MOAW format, the bodies returned by GET /processes/mychain and GET /processes/mychain/definition + application/moaw+json can do 2 things:

  1. both substitute "<some-real-data-here raw|href|collection|...>" by some { "schema": ... } object to make it a generic input type of the process that can be called with alternative values on the execution endpoint. The process description of the full workflow should only indicate "some-input" since under "inputs", since this is the only value that can be provided, others are enforced by the workflow.
  2. the process definition enforces those specific values (workflow not tweakable), but then "some-input" CANNOT be an input listed in process description since it cannot be provided at the execution endpoint.

If this mychain process cannot be executed directly with just { "some-input" : "my-alternative-data" }, but must instead provide the above payload entirely again, then Part 2 deploy using MOAW as no reason to exist. Deploying a Part 3 workflow brings nothing new, because it is resolved ad-hoc on the execution endpoint.

jerstlouis commented 2 years ago

@fmigneault

If that example workflow is intended to be a DeployableWorkflow, and "some-input" is an input parameter left open to be specified when executing mychain, then it should use the "input": ... extension intended for that as I described above:

{  
    "process": "url-top-most",
    "inputs": { 
        "input-1": {
            "process": "url-nested",
            "inputs": {
               "some-input": { "input" : "myChainInput1" }
            },
            "outputs": { "that-one": { "output": "myChainOutput1" } }
        }
    }
}

That wires the "myChainInput1" input of the myChain blackbox to the "some-input" of the "url-nested" internal process (and same for output).

A process description for myChain can be fully inferred from this, at least in terms of inputs / outputs (but not things like title and description cannot without providing details). The process description for myChain will list as inputs "myChainInput1" and as outputs "myChainOutput1". The type of "myChainInput1" can be inferred from the type of "some-input" in the url-nested's process description, since that is where it is used. The type of "myChainOutput1" can be inferred from the type of "that-one" in the url-nested's process description, since that is where it is used.

This is a DeployableWorkflow, so nothing to do with the "ad-hoc workflow execution" (which does not leave any input/output open, but would provide values for all inputs).

And to clarify again ad-hoc workflow stands in opposition to deployed workflow:

Does that help?

fmigneault commented 2 years ago

Yes, that helped a lot.

My following question is not about if process description can or cannot be inferred (it definitely can), but rather which approach between (1) and (2) in https://github.com/opengeospatial/ogcapi-processes/issues/279#issuecomment-1057209894 must be undertaken?

Is it safe to say that if "input" : "myChainInput1" is specified, then the process description would become (case 1):

{
   "id": "myChain",
   "inputs": { 
       "myChainInput1": { "schema" :  { "type": "string (a guess from 'some-input')" } }
   }, 
   "outputs": {
       "myChainOutput1": { "schema": { "type": "string (a guess from 'that-one')" } }
    }
}

But if "input" : "myChainInput1" was omitted (case 2), then the above process description would instead have {"inputs": {}} (ie: the execution request does not take any input, all is constant) ?

Also to make sure, would "outputs" of myChain also contain the outputs of "process": "url-top-most" (not explicitly listed)? Otherwise what was the point to execute this parent process in the workflow chain?

I think DeployableWorkflow and "ad-hoc workflow execution" could be considered as a whole, because I could take advantage of the similar structure to do both deploy+execute using this:

        "inputs": {
            "some-input": { "input" : "myChainInput1 (for deploy)", "value": "<some-data> (for execute)" }
        }

Mapping from/to MOAW/CWL would then be very much possible.

jerstlouis commented 2 years ago

Is it safe to say that if "input" : "myChainInput1" is specified, then the process description would become (case 1): But if "input" : "myChainInput1" was omitted (case 2), then the above process description would instead have {"inputs": {}} (ie: the execution request does not take any input, all is constant) ?

Correct, but then the workflow is not really intended to be deployed as a process as it does not accept any input. It would make more sense as an ad-hoc workflow execution, or POSTed as a persistent virtual collection to /collections instead.

Also to make sure, would "outputs" of myChain also contain the outputs of "process": "url-top-most" (not explicitly listed)? Otherwise what was the point to execute this parent process in the workflow chain?

My thinking (which is relatively recent since I realized we were missing this "output" while working out this thread's scenarios) is that if any "output" is specified, then there is no implied outputs. If no "output" is specified, then the top-level process's outputs are implied.

You are right that the top-level process would be pointless in this case, so for the example to make sense we should also specify another "output" from url-top-most's which would be a second output from mychain.

I think DeployableWorkflow and "ad-hoc workflow execution" could be considered as a whole, because I could take advantage of the similar structure to do both deploy+execute using this:

Well yes the MOAW syntax is the same in both cases, and much the same as Part 1 as well -- re-usability was definitely the goal.

Mapping from/to MOAW/CWL would then be very much possible.

Awesome :)

         "inputs": {
            "some-input": { "input" : "myChainInput1 (for deploy)", "value": "<some-data> (for execute)" }
        }

Would that ever happen in the same workflow though? I would think that you either deploy or execute... At the point where you execute the deployed workflow, you replace the "input" by the "value".

With CollectionInput, "collection": (collectionURL) is also a placeholder for different pieces of data sourced from that collection at different resolutions and areas of interest, using any API+formats combination supported by both ends of the hop.

m-mohr commented 2 years ago

@pvretano I've just seen the MulAdd example in the tiger team recordings. I think it would be a first good step to translate that into openEO to see how it compares. Can you point me to the example? I can't really read the URL in the video. Then I could do a quick crosswalk...

jerstlouis commented 2 years ago

@m-mohr In the meantime, for Mul and Add processes taking two operand inputs value1 and value2 it would look something like:

{
  "process": "https://example.com/ogcapi/processes/Mul",
  "inputs": {
   "value1": 10.2,
   "value2": {
      "process": "https://example.com/ogcapi/processes/Add",
      "inputs": {
         "value1": 3.14,
         "value2": 5.7
      }
    }
  }
}
m-mohr commented 2 years ago

Thanks, @jerstlouis, but I was looking at another example from @pvretano which had a lot more metadata included. The full example from him would be better to crosswalk as it shows more details.