opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes
Other
46 stars 45 forks source link

We should review and consider the OpenEO process Graph for Workflows #278

Open pvretano opened 2 years ago

pvretano commented 2 years ago

The purpose of this issues is just to register the fact that in the process of creating Part 3 (Workflows) we should review and consider the work already done with the OpenEO process graph and perhaps the SNAP graph (although based on what I have seen so far it seems that OpenEO process graph and SNAP graph are translatable one to the other). There is interest to see if OpenEO process graphs can inform the work on Part 3 or even be a conformance class of Part 3.

See also #47 where there was some discussion between @jerstlouis and @m-mohr.

fmigneault commented 2 years ago

Workflows implementation should also consider the use of CWL as it has been proposed in many previous OGC Testbeds, Engineering Reports and EO Apps efforts. This would also improve the synergy with other extensions, namely deploy_replace_undeploy, since the same standard would be employed for both deployed atomic applications, predefined processes, and workflows joining them.

Relates to

jerstlouis commented 2 years ago

@fmigneault @pvretano

Based on our previous discussions and analysis, we see CWL and Processes - Part 3 Workflows & Chaining as orthogonal and complementary capabilities with a different focus. A deployed local or remote CWL-defined workflow can be a process used within a Part 3 - Workflows & Chaining execution request.

To execute a CWL-based workflow with Processes, one must first deploy it as a registered process with the server, and then it becomes available on the server until it is deleted. This may require special administrative permissions on the server, and is usually visible to all clients. There is no particular mechanism to request data to be used as an input to the process, I imagine regular HTTP URLs or local data sources must be used.

Therefore I think a CWL conformance class makes sense as part of the OGC Application Package Best Practice and/or as part of Processes - Part 2: Transactions for deployment, as per the issue #258 you mentioned.

Processes - Part 3: Workflows & Chaining on the other hand is intended for quickly and easily executing ad-hoc workflows entirely from the end-user/client side with the ability to inter-connect in a workflow graph any local and/or remote OGC API data collections and/or processes and instantly being able to request execution results, without the need to first deploy the workflow as a process (although the capability to deploy a process based on a Workflows execution request using Processes - Part 2: Transactions for deployment is also planned). The workflow can easily be tweaked (e.g. to adjust parameters) and updated, and would normally only exist and be private to the client, until the point when the client wants to either publish it as a process or publish the result as an OGC API collection (which could still be dynamically generated, and always current based on the latest input data). Workflows & Chaining also provides the capability to do on-the-fly processing, e.g. on a tile-by-tile or bbox+resolution basis, as actual processing can be delayed to the time of the (processing result) data request (e.g. an OGC API - Coverage or Tile or Map request), which then specifies the desired output format, resolution and area of interest.

Caching (and invalidating the cache) and pre-emption of future requests and setting up lower-priority background "batch processing" which can be put on hold for more urgent immediate client requests, can be combined with all this for what I feel could really give great results in terms of efficient use of processing resources and real-time performance with the very latest data available.

With Workflows & Chaining, all OGC API implementations federated (simply by accepting to respond to and request from the other deployments in the federation) suddenly can act as a very flexible and powerful single system, simply by adding support to Processes - Part 1: Core execution request for nested processes as well as the concept of OGC API data collections as inputs and outputs of a process.

@pvretano Regarding openEO, from what I remember from the discussion with @m-mohr, in openEO the processes graph use finer grain processes which do very simple things like agregation and a single arithmetic operation. Although an implementation could actually decide to implement things this way (and that could possibly be an openEO conformance class to offer those required small building blocks well-known processes), my own view is that it would be more convenient to define well known processes for e.g. coverage processing that understands a particular coverage/raster processing language, be it CQL with extensions, Python + particular library, WCPS, GRASS raster map calculator, etc., and then using that generic coverage processing process as a single node of the workflow, with the expression as a constant parameter input to the process. That is because I believe it would be considerably more expressive to use e.g. CQL + extensions than a nested execution request, and a great deal easier to manually edit such expressions. From the #47 discussion consider https://api.openeo.org/assets/pg-evi-example.json vs. something like:

min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE))

There is also discussions to enable such expression / arithmetic / agregation capability directly in the data access APIs (e.g. EDR and Coverages). Then I think it would then be interesting to consider the capability to specify as additional properties in a "Collection input" for a workflow such additional operation(s) that should be used when requesting the data (which then turns into additional query parameters based on the hop-negotiated OGC API specification). A similar capability / use case is the ability to add e.g. a Features - Part 3 filter on an OGC API collection input to a process.

fmigneault commented 2 years ago

@jerstlouis

Using CWL does not necessarily mean that processes must be deployed at runtime. This is simply one format definition of the graph that tells how to chain I/O between given processes, which could all be entirely defined in the server with whichever library/language to handle their execution under the hood.

I consider the private/publishing aspect of processes or workflows completely out of scope from not only Part 3, but OGC API - Processes entirely. This should usually be the job of a policy enforcement point that is on top of the server where this API is served, since it implies even more definitions irrelevant to OGC API - Processes, such as users, groups, resources and permissions. The same applies for caching results and process execution priorities. Those are server optimizations and design choices based on business rules, which is not the role of OGC API - Processes to define (at most maybe propose recommendations), since each server will implement it differently anyway.

Either way, CWL would not limit editing, publishing or on the fly operations of Workflows any differently than any other selected specification. I am not against OpenEO either. Looking at https://api.openeo.org/assets/pg-evi-example.json, only a few adjustments would probably be needed to translate between OpenEO and CWL formats, as they both use similar parameters and I/O chaining with IDs between steps. I simply believe OGC API - Processes should work toward using a common specification for both workflows and deployments, rather than separate specifications for each. Having two different specifications complicates both implementations and user adoption/experience.

jerstlouis commented 2 years ago

@fmigneault

Using CWL does not necessarily mean that processes must be deployed at runtime.

I'm not sure I understand what you mean by this and how it relates to what I was trying to explain how CWL and Processes-Part 3 are orthogonal capabilities. To use CWL with Processes, before you can send an execution request, the Process must already exist on the server. With Processes-Part 3, the execution request is the workflow definition.

I consider the private/publishing aspect of processes or workflows completely out of scope from not only Part 3, but OGC API - Processes entirely. This should usually be the job of a policy enforcement point that is on top of the server where this API is served, since it implies even more definitions irrelevant to OGC API - Processes, such as users, groups, resources and permissions. The same applies for caching results and process execution priorities. Those are server optimizations and design choices based on business rules, which is not the role of OGC API - Processes to define (at most maybe propose recommendations), since each server will implement it differently anyway.

I fully agree with all this, I was just trying to paint a picture of how Part 3 makes this easy to do by presenting the input and output of a process as an OGC API collection accessible using data access OGC API specifications, and by triggering processing as a result of data access requests. And also as a Part 3 workflow is done as part of an execution request, it does not require to "create" a process resource on the server to execute it.

I simply believe OGC API - Processes should work toward using a common specification for both workflows and deployments, rather than separate specifications for each. Having two different specifications complicates both implementations and user adoption/experience.

Part 2 is about creating a new process on the server, Part 3 is about a client able to use a graph of both local and remote OGC API processes and collections directly as part of an execution request, and about triggering processing as a result of a data access request.

The two are fundamentally different, but complementary:

m-mohr commented 2 years ago

@pvretano Regarding openEO, from what I remember from the discussion with @m-mohr, in openEO the processes graph use finer grain processes which do very simple things like agregation and a single arithmetic operation.

While our processes are very fine-grained, the process graph doesn't care about granularity. It's completely independent of the processes.

Although an implementation could actually decide to implement things this way (and that could possibly be an openEO conformance class to offer those required small building blocks well-known processes)

Sounds like a good idea if there are multiple way to work with processes. On the other hand, I'm questioning the decision to invent a way as the openEO process graphs have been around before.

my own view is that it would be more convenient to define well known processes

Fully agreed! But (at least in openEO) processes and process graphs are something different and I think here as well? The process graph is the way to chain and combine individual processes to full a workflow.

From the #47 discussion consider https://api.openeo.org/assets/pg-evi-example.json vs. something like:

min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE))

Please note that we make heavy use of client libraries. What you have above is supported by all client libraries in a way that fits best to each programming language, but it is then converted into a process graph that is client and back-end independent so that it can be easily exchanged. The process graph you've linked to above is nothing a user usually sees. From an implementation perspective, the process graphs are easier to parse and generate by implementations than a math expression. I'm questioning how many people will use the APIs without client libraries and make direct use of such expressions in an HTTP request. We never see people ask for this. They ask for client libraries as they know Python, R or Julia...

pvretano commented 2 years ago

All with regard to defining a set of well known processes, that is not the function of either Part 2 or Part 3 of the specification. These parts are defining building blocks and so need to be more fundamental than that. Other parts or best practice documents would define "well known" processes for specific domains such as EO. Good discussion so far!

fmigneault commented 2 years ago

@jerstlouis

What I mean is that although Part 2 uses CWL as the way to deploy processes, it does not imply necessarily that CWL definitions can only exist when deployed. It is perfectly possible to have already existing processes defined and referenced by CWL on the server. There are in fact some "builtin processes" in CRIM's implementation that do exactly that, without any deployment involved. The CWL definition in itself only provides the details how to chain I/O between any given processes/scripts/apps and how to call them. Creating an execution graph of said processes can be done during a workflow execution request. In that regard, the execution graph generated by CWL is equivalent to the one defined by openEO. Only the semantics are slightly different between them, hence why it would probably be possible to translate between the two representations if need be.

I agree with you that Part 2 and Part 3 are different in nature, and complementary, but I was simply highlighting the fact that using two different standards/representations for seemingly the same resulting process execution graph would lead to higher burden for the user rather than a common one.

jerstlouis commented 2 years ago

@m-mohr

Fully agreed! But (at least in openEO) processes and process graphs are something different and I think here as well?

Yes, Part 3 just adds support for nested processes and using OGC API collections in the execute request, therefore allowing to create a graph directly there.

But generic well-known processes could be defined (separately from Part 2 or Part 3 as @pvretano mentions) where one of the inputs takes an expression to express part of the processing using a specific processing language used by a particular community.

The process graph you've linked to above is nothing a user usually sees.

In OGC API - Processes, the processes themselves however usually are visible to the user, who would discover and connect them in a graph (even if done through a UI -- but the execution requests are now simple enough that it is easy to do manually or to write code that does this).

From an implementation perspective, the process graphs are easier to parse and generate by implementations than a math expression.

Very similar to CQL2-JSON vs. CQL2-Text.

I'm questioning how many people will use the APIs without client libraries and make direct use of such expressions in an HTTP request. We never see people ask for this. They ask for client libraries as they know Python, R or Julia...

I agree most usage would be through libraries. But I think expressions to define coverage processing are something that many users do like to express directly in a particular flavor, that could be understood by a given well-known process.

jerstlouis commented 2 years ago

@fmigneault Thank you for clarifying, that makes sense of course that CWL can be used in the back-end separate from deployment.

using two different standards/representations for seemingly the same resulting process execution graph would lead to higher burden for the user rather than a common one.

The nice thing about Part 3 is that it simply adds to the Part 1 execution requests the possibility for inputs to be nested process (the same as the top level of the JSON document) or an OGC API collection (a link to .../collections/{collectionId} identified as such with "collection"). So for Processes client or service implementing Part 1, Part 3 is super easy to support, because they already deal with execution requests.

fmigneault commented 2 years ago

@jerstlouis I see. Then the specification/tool running this chain/graph of processes would be left to the implementation since it is transparent from OGC API - Processes point of view.

jerstlouis commented 2 years ago

@fmigneault Well the execution request is a chain/graph of processes (at least to the granularity of the processes exposed by the API -- each process might imply their own chain/graph as well).

In our implementation, the processing engine uses that directly. But another implementation could of course easily translate that to CWL if that is what their processing engine takes in.

If a process used in the graph was deployed or configured with either CWL or a Part 3 execution request, the source for that might also potentially be made available as well linked from the process description. And like any regular process, that deployed workflow process defined with CWL or Part 3 might also take inputs, that it can pass on appropriately to its sub-processes.

m-mohr commented 2 years ago

Yes, Part 3 just adds support for nested processes and using OGC API collections in the execute request, therefore allowing to create a graph directly there.

There is a difference: openEO process graphs are agnostic of data sources and instead defines a well-known process (e.g. load_collection) to load from OGC APIs. This way you can also define other data sources easily.

But generic well-known processes could be defined (separately from Part 2 or Part 3 as @pvretano mentions) where one of the inputs takes an expression to express part of the processing using a specific processing language used by a particular community.

I'm not so sure how useful that would be. Seems rather complex to mix things up. A conversion tool might be the better approach.

In OGC API - Processes, the processes themselves however usually are visible to the user, who would discover and connect them in a graph

Yes, in openEO, too. As mentioned before there's a difference between a process and a graph. I just meant you don't see the internal JSON structure but instead something that you can actually understand (e.g. web UI, Python expression, ...)

I agree most usage would be through libraries. But I think expressions to define coverage processing are something that many users do like to express directly in a particular flavor, that could be understood by a given well-known process.

I think I'd need to see an example to understand what you mean.

Anyway, I have the feeling that discussing in a GitHub issue is creating some misunderstandings right now as we all use slightly different terminology. It could make sense to discuss this in a separate call where we can show examples, go through them step by step, clarify misunderstandings directly etc. Seems much easier than writing lengthy posts here. Would there be an interest in that?

pvretano commented 2 years ago

@m-mohr perhaps we can carve out some time during the regularly schedules SWG meetings on Monday's to discuss. Next meeting is this Monday. I can request that we put this on the agenda.

m-mohr commented 2 years ago

@pvretano Which time are the meetings? If it doesn't conflict with the STAC meetings, then I should be able to attend.

pvretano commented 2 years ago

@m-mohr meetings are Mondays between 9:00am and 10:00am EST, every other week. So even if you have a conflict this week, we can discuss at the next meeting in 2 weeks. I don't recall that the STAC meeting are as regular. Correct?

m-mohr commented 2 years ago

@pvretano Okay, I'm available both days. I don't see a Processes meeting for Mondays in the OGC Portal, could you maybe send an invite or so?

bpross-52n commented 2 years ago

@m-mohr We meet on Monday, January 24th. Check here for details: https://portal.ogc.org/index.php?m=calendar&a=view&event_id=75196

m-mohr commented 2 years ago

Thanks, I'll be there if you can get it on the agenda.

mr-c commented 2 years ago

@m-mohr We meet on Monday, January 24th. Check here for details: https://portal.ogc.org/index.php?m=calendar&a=view&event_id=75196

Can I be invited as an external person?

bpross-52n commented 2 years ago

@m-mohr We meet on Monday, January 24th. Check here for details: https://portal.ogc.org/index.php?m=calendar&a=view&event_id=75196

Can I be invited as an external person?

@mr-c I think so. Please drop me an email, so I can forward the meeting info to you.

jerstlouis commented 2 years ago

@m-mohr Good idea to discuss this on Monday. But to quickly address your comments:

There is a difference: openEO process graphs are agnostic of data sources and instead defines a well-known process (e.g. load_collection) to load from OGC APIs. This way you can also define other data sources easily.

You could still do this (using well-known processes) with Processes - Part 3: Workflows & Chaining, but it has a built-in syntax specifically for OGC API collections that do not require a load_collection well-known process. This makes it easier to inter-operate with other OGC API specifications, and also because requesting data output from a remote process used as an input can also be done using OGC API specifications (e.g. Coverage, Tiles, DGGS requests) that will trigger processing if that particular Processes implementation also supports Part 3 (allowing to focus on Area / Resolution of interest and e.g. do parallel / distributed requests per tile, negotiate output formats, etc.).

I'm not so sure how useful that [ processing expressions ] would be. Seems rather complex to mix things up. A conversion tool might be the better approach.

None of Part 1, 2, 3 define this, so it is up to implementers to decide what makes sense for them and their community. A conversion tool may certainly be useful between different ways of expressing these. But I think there may sometimes be value in distinguishing between how to express processing for a single data collection or dataset (e.g. expressing computing a vegetation index from a coverage as an NDVI expression), vs. a higher level workflows involving different sources (how imagery retrieved from other servers feed as an input to the NDVI, and how that NDVI fits into a separate machine learning predictor process).

Yes, in openEO, too. As mentioned before there's a difference between a process and a graph. I just meant you don't see the internal JSON structure but instead something that you can actually understand (e.g. web UI, Python expression, ...)

The Part 3 JSON structure is actually quite concise and easy to understand, and with this concept of well-known processes for processing a particular collection of data, a Python expression string could be how a particular well-known process (e.g. the EOX python-coverage-processor in MOAW) expresses what is to be computed as one of the input to that process in the workflow. And web UI can still be built on top of that of course. I think there is a lot of subjectivity about all this, so it would make sense to leave this open to implementers, to the extent that we can show that things can still be interoperable (and there could still be e.g. an "openEO" profile of Part 3 as part of that solution to interoperability).

I think I'd need to see an example to understand what you mean.

See example Part 3 workfows in Annex B in the draft MOAW discussion paper.

Anyway, I have the feeling that discussing in a GitHub issue is creating some misunderstandings right now as we all use slightly different terminology. It could make sense to discuss this in a separate call where we can show examples, go through them step by step, clarify misunderstandings directly etc. Seems much easier than writing lengthy posts here. Would there be an interest in that?

Looking forward to discussing this on Monday!

fmigneault commented 2 years ago

Following the presentation of the process(metadata/graph) in openEO by @m-mohr, it is clear to me that both openEO and CWL would be equivalent and translatable representations of Workflows. In the case of Part 2, the deployed "process/CWL" is basically what openEO defines as the "process/graph", and the "process/metadata" definition comes from the core OGC API - Processes.

I believe the "collection" input presented by @jerstlouis could be defined as a new type of input (in parallel to current Literal, BoundingBox and Complex) in order to handle the additional processing of data retrieval requests. This wouldn't be a "workflow" per se, but rather a data handler, similar to how Complex handles URL to pull data in a given format.

I am still having issues regarding the chaining of nested Workflows with the approach proposed by @jerstlouis though. It seems to assume that all processes produce a single output which will be inline-inserted into the input of the top level process. As soon as this is not the case, or that multiple output formatting are possible, many additional execute parameters would need to be provided to "pick" the desired elements. In that regard, I believe the Workflow graphs defined by openEO or CWL are more adequate for the job.

Finally, I would like to extend one point @pvretano mentioned. It is true that in Part 2, Process/CWL are Deployed and then Executed using a 2nd request. This is only the way we designed to do things to avoid "redeploying" the full process (and often complex) definition each time. It would be entirely possible to directly submit the full CWL embedded in the Execute body, since that would only mean to read the CWL from the request body rather than from pre-stored one (from Deploy) in the database.

m-mohr commented 2 years ago

@fmigneault Thanks for these insights, appreciated. I'd like to point out that openEO processes are also just returning one output per process. Having that said, there are ways to mitigate that e.g. by wrapping in arrays or objects and then you can "pick" from it. This is very much like in most programming languages. Nevertheless, you can surely have multiple things being generated from a process, e.g. at the end of a graph you can e.g. generate both a netcdf and a geotiff file for some data.

jerstlouis commented 2 years ago

@fmigneault

It seems to assume that all processes produce a single output which will be inline-inserted into the input of the top level process. As soon as this is not the case, or that multiple output formatting are possible, many additional execute parameters would need to be provided to "pick" the desired elements.

Although most examples worked out so far focused on processes generating a single output, we also considered multiple outputs.

One suggested approach is that instead of a "collection" output (as at /collections/{collectionId}), a client could request a "dataset API" output which would be just like an OGC API landing page linking to multiple collections (each an individual output).

For the chaining, either a process would expect an input to be a "dataset" with multiple outputs, or it could select one or more individual output of the process using the usual "outputs" key of the Processes - Part 1 execution request.

As for the formatting, the idea is that a client only needs to specify the final output format, while leaving the hops alongside the workflow chain to negotiate between themselves the ideal formats and API based on common capabilities (negotiate using Tiles with a particular TileMatrixSet or Coverages or DGGS API, negotiate GeoTIFF or netCDF to exchange data -- even though all the client specified is that in the end it wants a JPG for the final output).