Other process encodings?

opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes

Other

48 stars 45 forks source link

Other process encodings? #325

Open m-mohr opened 1 year ago

m-mohr commented 1 year ago

In Part 1 I found the following sentence:

The Core does not mandate the use of any specific process description to specify the interface of a process. Instead this standard defines and recommends the use of the following conformance class: OGC Process Description

This means I'd expect that I could for example use the openEO process encoding in OAP.

Requirement 11 in "Core" says:

The content of that response SHALL be based upon the OpenAPI 3.0 schema processList.yaml.

The processList.yaml refers to the processSummary.yaml though which has a very specific encoding in mind: http://schemas.opengis.net/ogcapi/processes/part1/1.0/openapi/schemas/processSummary.yaml

Thus, I think there's a conflict in the specification that should be resolved. Are we allowed to return e.g. openEO processes in /processses?

pvretano commented 1 year ago

@m-mohr no. As you point out, the /processes response must conform to the processList schema which in turn references processSummary.yaml. /processes is just the list of available processes with some summary description of each process. Looking at it now, we should probably move jobControlOptions and transmissionMode to process.yaml since they are specific to the OGC process description vocabulary.

You can however, respond with an openEO process description at the /processes/{processId} endpoint. That is what the requirement from core that you cite is trying to say using the terminology "interface of a process". The interface of a process is at /processes/{processId}.

Not sure if the specification mentions this explicitly but in the links section of the process summary you can include links to the process description in any number of process description languages or vocabularies so you could include a link to an openEO description from there.

jerstlouis commented 1 year ago

@pvretano Regarding what is returned in the process list at /processes (which is the summary of each process), would this not depend on the negotiated media type? If the implementation does not specify conformance to "OGC process description" but specifies conformance to a new "OpenAPI process description" for example, couldn't it return an OpenAPI document?

Clients can already negotiate an HTML representation of /processes as another example.

The process list uses the "self" relation type inside the summary to link to the full object for each individual process, so it would make sense that the representation of the summary and that of the full process description are consistent based on a particular negotiated representation.

m-mohr commented 1 year ago

Thanks. I guess the content negotiation doesn't help if the media type for both is application/json (openEO processes and OGC process descriptions are both JSON).

jerstlouis commented 1 year ago

@m-mohr In Part 3, there is currently a suggestion in 14-Media Types to define specific media types. I suggest e.g., application/ogcexec+json for the execution requests. There should also be specific media types for OpenEO. And possibly we could do the same for the process description. However, sticking to application/json for the default OGC process description would make sense. I think it is kind of expected that most processes implementation would support it.

This is an issue that pops up everywhere application/json is used as a content type (just saying JSON, XML or PBF says next to nothing about the content type), and is related to the negotiation by profile issue which would be an alternative in differentiating them.

Possibly we should always define new specific media types for new JSON schemas.

m-mohr commented 1 year ago

The issue in /processes is also that it has a JSON "wrapper" (links, processes - which is actually the same in OAP and openEO) and only the individual processes included in the processes property differ. I assume application/ogcexec+json describes the process itself, not necessarily the full response of /processes?!

jerstlouis commented 1 year ago

@m-mohr If the component processes of OpenEO could be defined as regular OGC Process Description that would really be ideal. The OpenEO process graph is the equivalent of the process execution that is extended in Part 3 to be able to nest processes.

I assume application/ogcexec+json describes the process itself, not necessarily the full response of /processes?

Correct, sorry for the confusion -- I edited my message for clarity. It's for making a request to execute the process, not describe it. Not a response, but the payload from the client to execute the process.

In a sense it's a distinction between a workflow description (process execution request or OpenEO process graph) vs. a process description (single black box process).

It is possible that a single process also happens to be defined by a workflow (process execution request or OpenEO process graph), in which case that could be made available as a visible "execution unit" (the definition of that process, not its description). That is related to Part 2: Deploy, Replace, Update and the Deployed Workflows requirements class of Part 3. The description is only the inputs and outputs of the process; the definition's execution unit is what the process actually does expressed in some way (Docker container, execution request, OpenEO process graph, CWL, Python script...).

fmigneault commented 1 year ago

However, sticking to application/json for the default OGC process description would make sense. I think it is kind of expected that most processes implementation would support it.

I agree about the default application/json response format.

Also, I don't understand why it is not mandatory to ensure interoperability. As mentioned, the definition is what changes a lot between implementations and extensions, but the core description should be somewhat consistent regardless. At least a minimal set of critical components such as the inputs and outputs definition should be standardized, since a standardized representation for POST'ing the execution content is needed anyway.

Note that even if application/json was returned for multiple encodings, we could provide (or mandate to provide) a $schema or a JSON-LD @context field refering to which one is being represented. There is no need to introduced new Media-Type each time, as parsing the contents is effectively always valid in JSON.

pvretano commented 1 year ago

21-AUG-2023: There is a difference between what you get at /processes and /processes/{processId}. What you get at /processes is a list of available processes. What you get at /processes/{processId} is an actual process description.

The schema for the "list of processes" is fixed by the specification and is defined in processSummary.yaml. All implementations of OAPIP regardless of how they describe their processes must use the same summary schema for the list of processes. You can negotiate a different output format (e.g. XML) at /processes but the specification only defines HTML and JSON output right now.

The story is different at /processes/{processesId}. At this endpoint you are request a detailed processes description that includes input definitions, output definitions, etc. At this endpoint the specification DOES NOT mandate a particular schema. Rather it includes a conformance class for an OGC process description but other process descriptions such as openEO are possible. We do need do define a media type for an OGC process description so that content negotiation can be used to distinguish an OGC process description from an openEO process description for example.

Assigning to @pvretano to create a PR to add a media type for an OGC processes description ...

m-mohr commented 1 year ago

Thanks. Would it be an option that it's allowed to mix different process types for /processes? So that the processes array can contain e.g. openEO and OGC API Processes descriptions? We are currently experimenting with this in the GDC API and I have a client that supports it already. It would be a way to support openEO and OAP through the same API without requiring different requests with different Accept media types for the specifications.

jerstlouis commented 1 year ago

Even if we have do have separate media types for openEO vs. OGC process descriptions, with the openEO description friendly to existing openEO clients, ideally I think it should also be possible for openEO backends to offer an OGC process description for those openEO processes for clients implementing strictly OGC API - Processes. Still hoping we can validate the feasibility of this in T19-GDC (even if we don't have time to implement it).

@pvretano Perhaps new media types for process description and/or execution requests is one thing that we could require in 2.0. It would be good for openEO to also have a different media type for both.

I think it would also make sense to explore having an OpenAPI process description, which we could consider including in 2.0 as a separate requirements class, perhaps in the sprint next week if we have time.

That is, /processes/{processId} for Accept: Application/vnd.oai.openapi+json;version=3.0 would return an OpenAPI definition tailored specifically to that single process, focusing on the content of the individual "inputs" in the schema for the POST request to /processes/{processId}/execution and the result (for the single output synchronous 200 response and/or for the async job results at .../results/{resultId}).

fmigneault commented 1 year ago

I think it should also be possible for openEO backends to offer an OGC process description for those openEO processes for clients implementing strictly OGC API - Processes.

IMO, it is not "should", but "must". Otherwise, they are not really an interoperable OGC API - Processes implementation...

Interoperability is already barely accomplished with current implementations that should be using the same process description format. Adding alternatives will make it even more complicated than it already is. Nothing wrong in allowing additional media-types/encodings though, as long as the common one is fulfilled.

explore having an OpenAPI process description

I'm curious by what you have in mind? Isn't this already offered for inputs/outputs using this portion: https://github.com/opengeospatial/ogcapi-processes/blob/master/openapi/schemas/processes-core/inputDescription.yaml#L17-L18

jerstlouis commented 1 year ago

@fmigneault

IMO, it is not "should", but "must". Otherwise, they are not really an interoperable OGC API - Processes implementation...

Technically the OGC process description is not a mandatory requirement class, but I agree that this is a very strong should, and I hope it can be achieved.

There is on-going discussion about whether an implementtion of a GeoDataCube API supporting processing with an openEO backend should be fully aligned with OGC API - Processes, and I believe it should (including support for OGC Process Description), so that a generic OGC API - Processes / GeoDataCube API processing client can execute it, but this differs from the current published openEO specification. I will present on this topic and there will be a follow-on discussion on Monday the 25th at the Singapore meeting in the GeoDataCube SWG.

I'm curious by what you have in mind? Isn't this already offered for inputs/outputs using this portion:

This is a JSON Schema within an OGC process description, not an OpenAPI definition. I am thinking of the process description itself being an OpenAPI definition, as can be found at /api, but in the case of /processes/{processId} (Accept: Application/vnd.oai.openapi+json;version=3.0), the included paths only concern the execution and response for that particular process, and the same schema is embedded directly within the POST payload JSON Schema for the /processes/{processId}/execution path (not using the OGC process description to describe it, but the OpenAPI way).

Essentially this would allow generic OpenAPI clients and developers to execute OGC API - Processes without knowing anything about OGC Process Descriptions or OGC API - Processes in general.

pvretano commented 1 year ago

@fmigneault @jerstlouis when we first wrote WFS, for the sake of interoperability, we mandated GML. That turned out the be both a good thing (for interoperability) and a bad thing (because GML was a beast to implement). So, when the time came to design OGC API Features we decided, instead, not to mandate any particular format and let the client and the server negotiate a mutally agreeable format. To HELP interoperability we added conformance classes to OGC API Features for GeoJSON and GML. The situation is similar here. I don't think we should MANDATE a specific process description language but rather let the client and the server negotiate a mutually agreeable format. However, as we did with Features, we included the OGC Process Description conformance class and RECOMMENDED that servers implement that for interoperability. I don't think we need to qualify the recomendation any further. The specification recomends it and that should be sufficient. In features we usually say "if [GeoJSON | GML] is suitable for your purposes ... blah, blah, blah".

jerstlouis commented 1 year ago

@pvretano agreed, but where I hope we require it is as a dependency of the GeoDataCube API "Processing" conformance class (I added "OGC Process Description" to the processes-1 row in https://gitlab.ogc.org/ogc/T19-GDC/-/issues/25), which is sort of "profiling" OGC API standards to maximize interoperability (i.e., the chance of a successful client/server negotiation).

pvretano commented 1 year ago

@jerstlouis agreed! It is perfectly legal for a referencing specification like GDC to say that an optional conformance class (OGC Process Description in this case) has to be mandatory in the GDC profile of processes.

m-mohr commented 1 year ago

Before we can require the OGC Process description we should make sure it's good enough to cater for most needs. I'm not sure whether we could encode openEO processes in OGC Process descriptions for example.

The issue with content negotiation is that you may have two JSON-based descriptions that don't have specific media types for them. And then you must also be able to convert from one encoding to the other, which may not be possible with some losses (see above).

jerstlouis commented 1 year ago

@m-mohr

Before we can require the OGC Process description we should make sure it's good enough to cater for most needs

That is very reasonable.

I'm not sure whether we could encode openEO processes in OGC Process descriptions for example.

Can we perform the experiment and validate that? Would you have an example openEO process description that exercises most of the capabilities, and we can try to do the mapping?

If something is missing, there would be no better time than right now to try to address this with the Processes Code Sprint early next week validating Processes 1.1 or 2.0.

I really believe it is critical for interoperability to have this OGC Process Description support all use cases, including the openEO process descriptions.

m-mohr commented 1 year ago

I'd love to, but I'm on vacation until October so I can't do it before the sprint.

jerstlouis commented 1 year ago

@m-mohr Enjoy your vacation :)

But if you have time to just point us to a good sample openEO process description between a Mai Tai and a Piña Colada I could give it a try next week :)

m-mohr commented 1 year ago

I can only point you to the official docs right now, so the process description schema at https://api.openeo.org/#tag/Process-Discovery/operation/list-processes and the processes at https://processes.openeo.org

pvretano commented 1 year ago

I really believe it is critical for interoperability to have this OGC Process Description support all use cases, including the openEO process descriptions.

@jerstlouis if I understand what you are saying, I am not sure I agree.

Interoperability is not a function of the "OGC Process Description". The "OGC Process Description" is one way to describe a process. OpenEO is another as is CWL. For that matter, so is OpenAPI (I have been experimenting with posting OpenAPI descriptions of a process to my server).

What is required is that the API can accomodate each of these process description languages which it can via different conformance classes. The "OGC Process Description" conformance class already exists which means that a client can request the description (i.e. GET /process/{processId}) of a process as an "OGC Process Description". Assuming the necessary conformance classes existed, a client could then negotiate with the server to request the same description using OpenEO or CWL or OpenAPI in response to a GET /processes/{processId} request. Assuming, for example, that a server claimed support for all these process description languages a client (using Part 2) could deploy a process that is described using CWL and then another client could request a description of that deployed process using OpenEO. The only requirement is that mime types exists so that clients can negotiate a response in their desired process description language or format.

The same line of reasoning would apply to Part 2 where a process is described for the purpose of deployment. A server that claims to support multiple process description languages could then deploy a process described using "OGC Process Description" or OpenEO or CWL or ...

So I guess what I am saying in response to your comment and @m-mohr original comment is that it should not be a matter of mandating support for one process description language/format and then making sure that format can accomodate other process description languages/formats. Each process description language/format should be supported in its own right (via separate conformance classes) and the server should be responsible (internally) for crosswalking one to other as per a client's request.

How that I am writing this it occures to me that it may conflict with my previous agreement vis-a-vis GDC. Sorry about that but I have been thinking more about the situatiion and this comment reflects my current thinking. I could be complete wrong but I welcome response comments because this is an imporant interoperability point.

jerstlouis commented 1 year ago

@pvretano Are we perhaps mixing up Process Description and Process Definition here?

By Process Description I am referring strictly to /processes/{processId} in terms of a client executing a process using Part 1 (potentially with some Part 3 extensions) and being able to find the relevant information (the inputs/oututs and their data types) in a consistent manner.

Although there is the notion that a Part 2 deployment can include a process description if the server can't figure out how to make up an OGC Process Description by itself, we are not talking about deployment at all here.

By interoperability, I mean any client would be able to only implement Part 1 with OGC Process Description, and would be able to execute any process, regardless of how it was defined (whether with CWL, openEO, Part 3, or anything else).

jerstlouis commented 1 year ago

From a quick look at the first process from the EURAC GDC API end-point ( https://dev.openeo.eurac.edu/processes ), it seems to me that the basics of the openEO -> OGC process description mapping is quite straightforward.

The openEO "parameters" correspond to the OGC "inputs", and the openEO "returns" corresponds to the OGC "outputs".

The openEO "schema" corresponds to the OGC "schema" with the caveats related to the multiplicity (minOccurs / maxOccurs).

The "subtype": "raster-cube" could potentially correspond to a special "format" that we could define for gridded coverages, agnostic of any specific media type, like we do for feature collections. In the meantime the fact that an input or output is a gridded coverage / geo data cube is usually indicated by listing supported coverage media types such as "type": "string", "contentEncoding": "binary", "contentMediaType": "image/tiff; application=geotiff", or netCDF for additional dimensions.

m-mohr commented 1 year ago

I don't think it's that easy. For example, the returns and outputs to me are slightly different and in my conversion I had to make the OGC outputs openEO parameters (as there's a choice to be made). A couple of thinks can't be translated at all, I think. For example the process_graph and examples properties (but they are strictly descriptive and not required). Anyway, it looks like it's a lossy migration process.

fmigneault commented 1 year ago

@jerstlouis @m-mohr I would like to remind that OGC Process Description can include an executionUnit part that can contain the exact representation of openEO if anything is missing on the OGC side (see Part 2 DRU example: https://github.com/opengeospatial/ogcapi-processes/blob/8c41db3fbc804450c89d94b704e1d241105272ae/openapi/schemas/processes-dru/ogcapppkg.yaml). Also, even Part 1: Core technically allows additional parameters in https://github.com/opengeospatial/ogcapi-processes/blob/8c41db3fbc804450c89d94b704e1d241105272ae/openapi/schemas/processes-core/process.yaml. Therefore, an openEO description would be compliant with OGC Process Description if it adds the missing parts, even if it provides more parameters.

The core requirements of an OGC Process Description would be to port what is compatible, namely, the critical inputs and outputs sections. Since openEO and OGC both employ JSON schema parameters to define I/O, I do not see were the problem is. Anything additional required by openEO for performing the actual "execution" can refer directly its it's executionUnit representation, or its other additional parameters, after mapping inputs -> parameters and outputs -> returns.

For the process_graph part, that would need to be ported into Part 3: Workflows, or Part 2: DRU with some executionUnit strategy that knows how to handle a Workflow such as CWL, or directly the openEO graph for that matter.

jerstlouis commented 1 year ago

@m-mohr Right, as @fmigneault points out, for the process_graph of a particular process, that should be possible to retrieve (if the service does wish to expose the inner working of the process) as a link to a definition of the process (the executionUnit of an application package). We do support that already in our implementation for ad-hoc execution e.g., https://maps.gnosis.earth/ogcapi/collections/temp-exec-48D2606E/workflow , but could also have that for deployed processes ( e.g., /processes/{processId}/definition or /processes/{processId}/executionUnit).

The examples is something that could already be added to the process description without breaking anything, and it could be something that gets specified in a later version to standardize the approach.

I had to make the OGC outputs openEO parameters (as there's a choice to be made)

Could you please provide more details about this? If this is about providing as an input a "destination" where things get saved / end up, I think as you say this is a choice that can be made for an individual process, but I think either way can work with Processes - Part 1.

To support the Part 3 approach of collection input / output, where a "collection / GeoDataCube" is a first class object, results should really always and only be outputs with no "storage" specified for them... things just flow out of the process and are requested / processing triggered using OGC API data access calls.

m-mohr commented 1 year ago

While we might be able to translate it, why should we do it? We loose all the openEO clients and get OGC API - Processes clients, which honestly, I haven't seen a good example of. Why not just allow different process flavours in /processes via conformance classes as we pretty much do in GDC right now?

Could you please provide more details about this?

For the output you need to specify what format you want. This needs to be a parameter in openEO as for return values it just describes what you get, there's no choice. Everything you need to choose from must be a parameter. As such I also think your return values are a bit flawed as it effectively is an input that the user may need to provide.

jerstlouis commented 1 year ago

@m-mohr

We loose all the openEO clients and get OGC API - Processes clients Why not just allow different process flavours in /processes via conformance classes as we pretty much do in GDC right now?

If an implementation could support both the openEO-style and OGC API - Processses style description, with a different media type, then hopefully you would not lose the existing openEO clients. However, there may still be the other issues of clashing resource paths regarding execution.

In GDC right now, if I understand the current status, it is not possible for a server to implement both openEO and OGC API - Processes at the same end-point. This is "why", to provide an OGC API - Processes flavor of openEO that is fully aligned, even if it differs from the openEO Community Standard based on the current spec, to support OGC API - Processes clients (I think our own client is "OK", and hopefully good examples will appear). It would also allow to easily integrate these openEO workflows as components of the Part 3 workflows. An implementation could support both openEO and OGC API - Processes clients. If need be, such an implementation could provide a different resource path for each, if clashes with the current spec / openEO community standard remains.

For the output you need to specify what format you want. This needs to be a parameter in openEO as for return values it just describes what you get, there's no choice. Everything you need to choose from must be a parameter. As such I also think your return values are a bit flawed as it effectively is an input that the user may need to provide.

With the upcoming (1.1 / 2.0?) revision, the choice of output format for the output is basically being replaced by HTTP content negotiation. This is already the case with Processes - Part 3 "Collection Output" (you can negotiate more than just the format, you can also negotiate the access API e.g. Tiles or Coverages or DGGS, area/time/resolution of interest). As pointed out in https://github.com/opengeospatial/ogcapi-common/issues/160#issuecomment-1708918113, content negotiation is the general approach on the web for this. However, I don't think this is a major issue in aligning openEO and Processes, a format parameter/input can either be there or not, if it is there it can be optional or not, and ideally content negotiation is also supported.

I would also very much like to see Part 3 "Collection Input / Output" as part of a processing profile of the GDC API that makes GeoDataCubes first class objects, since it allows the simple data access only clients to also trigger processing / access processing results the same way they access a regular pre-processed GeoDataCube.

m-mohr commented 1 year ago

with a different media type

Both are application/json AFAIK, so content negotiation fails.

there may still be the other issues of clashing resource paths regarding execution.

For example?

In GDC right now, if I understand the current status, it is not possible for a server to implement both openEO and OGC API - Processes at the same end-point.

I think it's possible in the GDC API.

With the upcoming (1.1 / 2.0?) revision, the choice of output format for the output is basically being replaced by HTTP content negotiation.

Which effectively doesn't change the issue too much, it's still something the user needs to choose which effectively is some kinf of input/parameter.

jerstlouis commented 1 year ago

Both are application/json AFAIK, so content negotiation fails.

Yes, that is part of the issue. One needs to change and/or we need another Accept-profile: header or other differentiating mechanism.

For example?

Well OGC API - Processes are executed by POSTing an execution request to /processes/{processId}/execution.

If we only consider openEO as a an execution unit that gets deployed using Part 2 - Deploy, Replace, Update, then clients could use the usual Part 1 to the deployed process ID. Ideally, this would also need some mechanism to define external input to that deployed process so that they can be re-used (a parameterized workflow).

If also considering ad-hoc execution of an openEO process, it should be possible to execute a single process at /processes/{processId}/execution (Part 1 - Core).

If we also consider ad-hoc execution of an openEO workflow, following the Part 3 nested processes, then it should also be possible to post to the top-level process of that workflow. I am not yet fully convinced that there could not be such a top-level process, considering how closely the openEO parameters / Processes input and openEO returns /Processes outputs correlate. Alternatively, we could figure out another resource path where those openEO workflows are POSTed for ad-hoc execution.

If I recall correctly, openEO defines POST to /jobs which do not exist with Processes. I guess this is not so much a clash as nothing yet uses that method for that resource.

I think that is all of the clashes?, but I did not verify with the latest cross-walks.

I think it's possible in the GDC API.

In the current version? How if the openEO and Processes conformance classes specify to return different things at /processes/{processId} for Accept: application/json?

Which effectively doesn't change the issue too much, it's still something the user needs to choose which effectively is some kinf of input/parameter.

I mentioned something similar before, but:

For a final output that the user is trying to export, the user's client tool may itself be able to convert / export to a variety of formats, regardless of what it it retrieved from the processing API. So that choice could be made completely independently of the workflow definition, and could be or not be used as the Accept: header or input passed to the server.
For an intermediate output as part of a workflow, that negotiation ultimately should not concern the user. The the two ends of the hop can negotiate that between themselves, and can take into consideration the following hop's input requirements as well as its own transformation capabilities.

Ultimately, I am of the opinion that formats are an implementation detail internal to the client/server components. Most Web users are oblivious to whether they are looking at a PNG or a JPEG image. The client software is what deals with content/format negotiation. It should be the same for researchers assembling geo-processing workflows and visualizing / interpreting their outputs.

m-mohr commented 1 year ago

Yes, that is part of the issue. One needs to change and/or we need another Accept-profile: header or other differentiating mechanism.

I mean ultimately both need to change because application/json is too generic (or the approach is wrong, which I think is the case).

Well OGC API - Processes are executed by POSTing an execution request to /processes/{processId}/execution.

No clash. Just different endpoints.

If I recall correctly, openEO defines POST to /jobs which do not exist with Processes. I guess this is not so much a clash as nothing yet uses that method for that resource.

Same issue as in /processes. They could be mixed (as described in GDC) or you do content negotiation (with new media types again).

In the current version?

Yes, I don't see any clashes in GDC API. I think I resolved all of them when merging the APIs.

How if the openEO and Processes conformance classes specify to return different things at /processes/{processId} for Accept: application/json?

openEO doesn't have such an endpoint right now.

For a final output that the user is trying to export, the user's client tool may itself be able to convert / export to a variety of formats, regardless of what it it retrieved from the processing API. So that choice could be made completely independently of the workflow definition, and could be or not be used as the Accept: header or input passed to the server.

I've never seen this in an end-user scenario. M2M sure, but not if you expose the client to end-users. People usually know what they want.

For an intermediate output as part of a workflow, that negotiation ultimately should not concern the user.

If I - as a user - want a GeoTiff, why is this not my concern?

Ultimately, I am of the opinion that formats are an implementation detail internal to the client/server components. Most Web users are oblivious to whether they are looking at a PNG or a JPEG image. The client software is what deals with content/format negotiation. It should be the same for researchers assembling geo-processing workflows and visualizing / interpreting their outputs.

That's not what we are seeing in openEO. People want specific formats (usually the ones that they know/usually work with). At least whenever it's not just a quick visualization in the browser, but further processing locally.

But we are dirfting away here. Could we stick the discussion to the process encodings in /processes?

jerstlouis commented 1 year ago

I mean ultimately both need to change because application/json is too generic

I agree in general that application/json is too generic.

Yes, I don't see any clashes in GDC API. I think I resolved all of them when merging the APIs.

Sorry I had not fully realized that was the case. I see how now all jobs listed at /jobs and processes at /processes can be openEO and/or OGC API - Processes, and as per the note they can be differentiated by their jobID vs. id, and version / parameters / returns properties.

Still, it would be much nicer if we could achieve the ability to execute openEO processes like regular OGC API - Processes. If that is possible, could the extra openEO properties (jobID, parameters, returns...) just be added to support openEO clients as well, or are there other clashing properties?

If I - as a user - want a GeoTiff, why is this not my concern? That's not what we are seeing in openEO. People want specific formats (usually the ones that they know/usually work with). At least whenever it's not just a quick visualization in the browser, but further processing locally.

If the client is integrated within their "further processing" tool chain, that tool chain is what wants a specific format. As an end-user (at least the one at the very end), you are concerned about answering a particular question or producing a map. (NOTE: This opinion is from the perspective of where the puck is going, not necessarily where it is right now).

But we are drifting away here. Could we stick the discussion to the process encodings in /processes?

Yes, sorry :) The ability and rationale for selecting a particular format was somewhat of a relevant parenthesis.

I think first we should clarify whether it makes sense to execute openEO processes the OGC API - Processes way. Then, whether we can describe the same process (same process ID) for both openEO and OGC API - Processes within the same /processes JSON output.

If openEO execution is completely different from OGC API - Processes execution (even if they can live together at the same end-point in a GDC API), then the discussion of this issue is probably not relevant, since this issue is in the context of OGC API - Processes (Part 1: Core), which would not apply to these openEO processes.

There may also be a clash in terms of having processes listed that do not support Processes - Part 1 (from the perspective of conforming to Processes - Part 1 where some of the processes that should be available are not conformant). But personally I would very much like to see a full alignment where it is possible to execute those openEO processes using Processes - Part 1 & 3.

fmigneault commented 1 year ago

@m-mohr

Why not just allow different process flavours in /processes via conformance classes as we pretty much do in GDC right now?

What would be the point to say a server is OGC API - Processes compliant, if the core requirement to have the expected process description is not fulfilled? OGC API - Processes should not adapt to every possible implementation. It should be implementation that provides support for OGC API - Processes as one of its representation of a "process".

Both are application/json AFAIK, so content negotiation fails.

We can add more, such as adding some application/openeo+json or application/ogc+json. The application/json is the default for OGC API - Processes description employed here because this is the current standard being described. If you want your openEO server to use another representation by default for application/json, that is fine. Many alternate Media-Types could be added. The standard doesn't disallow alternate representations. See some examples of media-types recently added for alternate representations:

https://github.com/opengeospatial/ogcapi-processes/blob/8c41db3fbc804450c89d94b704e1d241105272ae/openapi/paths/processes-dru/pProcessDescriptionReplaceUndeploy.yaml#L36-L47

I think it could be a good idea to provide an "official recommendation" however of the appropriate media-type, such as application/ogc+json to help in these kind of efforts align. That would be useful for my implementation as well to distinguish from application/cwl+json.

For the output you need to specify what format you want. This needs to be a parameter in openEO as for return values it just describes what you get, there's no choice.

Not sure I understand the problem of this. If there is only 1 possible output format in openEO, that is simply what you get. By default, OGC API - Processes allows you to omit the desired output format, and this should return the only one available. If on the other hand, you have a choice to be provided as input (parameter), this can also be done. The outputs (return) would only indicate the supported formats.

For example, I have this process https://finch.crim.ca/providers/finch/processes/ensemble_grid_point_wetdays that uses inputs.output_format with enum: [netcdf, csv] in the schema, and the outputs.output has formats for the relevant choices from output_format passed as parameter. During execution, only one of the formats for output would ever be possible, so the user/client does not need to specify any Content-Negociation about which one they want to obtain. They only specify the inputs.output_format when submitting the process execution.

If I didn't understand, please provide more details. I would like to support openEO representation as well (https://github.com/crim-ca/weaver/issues/564). Intricate details like this would be relevant.

Ultimately, I am of the opinion that formats are an implementation detail internal to the client/server components.

I agree.

m-mohr commented 1 year ago

What would be the point to say a server is OGC API - Processes compliant, if the core requirement to have the expected process description is not fulfilled? OGC API - Processes should not adapt to every possible implementation. It should be implementation that provides support for OGC API - Processes as one of its representation of a "process".

Well, similarly I could say why bother with OGC API - Processes? We also have all we need in openEO. I thought we try to somehow align them and discuss potential solutions, but if you don't want to change anything, just say it, please. It feels like: We leave OGC API as it is anyway, openEO has to change.

Not sure I understand the problem of this. If there is only 1 possible output format in openEO

A return value in openEO is not about file formats, it's about data types. Like a addition opteration returns a number, there's no choice for the user. Processes in openEO and OAP are pretty different in scope.

For example, I have this process https://finch.crim.ca/providers/finch/processes/ensemble_grid_point_wetdays that uses inputs.output_format with enum: [netcdf, csv] in the schema

Great, that's what I'm looking for. Unfortunately, I've never seen that in any of the GDC implementations. There you get a choice in the return value, but no input parameter for it.

fmigneault commented 1 year ago

I agree with aligning and finding solutions. I don't think however that allowing any process representation is a "solution". This would be like saying */* is good enough for Content-Type because it always works. It is the same as application/json being too generic at the moment, and causing the Content-Negotiation issues evoked.

A return value in openEO is not about file formats

Outputs in OGC API - Processes are not exclusively files. Literal data with bool, float, int, string are also valid. Each of those can also provide encoding and schema with even further validation rules. Basically, anything that JSON schema can validate can be used for OGC API - Processes outputs as well.

This process probably provides the best (though still not exhaustive) example of various combinations of output data types: https://github.com/opengeospatial/ogcapi-processes/blob/8c41db3fbc804450c89d94b704e1d241105272ae/core/examples/json/ProcessDescription.json#L2

jerstlouis commented 1 year ago

Great, that's what I'm looking for. Unfortunately, I've never seen that in any of the GDC implementations. There you get a choice in the return value, but no input parameter for it.

The normal Processes - Part 1.0 way is really through the "output" setting. @fmigneault if you list multiple output formats in "outputs", all of those formats would normally be available regardless of the input values used (though I am not sure to what extent the Standard is clear about that or strictly requires that). A dedicated parameter is not needed, since you can specify the output format in the "outputs" section of the request. But as I was telling @m-mohr, introducing a dedicated input like you do is something that is probably fine, although that is not the recommmended approach.

The 1.1/2.0 way avoids the need for both the input and the "output" setting with content negotiation.

but if you don't want to change anything, just say it, please.

I do think we want to minimize the amount of changes and maximize backward compatibility with the published OGC API - Processes - Part 1: Core.

It feels like: We leave OGC API as it is anyway, openEO has to change.

My suggestion is to leave the openEO Community Standard as-is, and have the Processes - Part 3 openEO Workflow requirement class define a new capability relying on the openEO process graphs (as-is) and well-known openEO processes (as-is) while integrating these into the OGC API - Processes framework.

So nobody needs to change, since I think both communities are very reluctant to any change to these published / well-established standard / specification. But we would be defining this new Processes-integrated flavor of openEO, which should allow servers to easily support both, and users to easily re-use their workflow between the two flavors. Which flavor(s) becomes part of the OGC GDC API is another question.

fmigneault commented 1 year ago

@jerstlouis Indeed, the preferable approach would be to use the outputs' format if that is possible. However, this does not work with many deployed applications in DRU, becasue they usually use an input option/argument/flag passed to a command/script to select the desired output. There is no automatic method to convert dynamically a format media-type/encoding/schema to an unknown command line argument of the underlying application. I think the same situation would apply for user-defined processes in openEO. Specified inputs usually dictate the format of obtained results.

The 1.1/2.0 way avoids the need for both the input and the "output" setting with content negotiation.

This is only true for sync and single output. The inputs/outputs are still required for async and multi-output executions because content negotiation is irrelevant, application/json is always returned. Typically, async is used for longer running processes that must be told in advance what is the desired output, and multi-output cannot all be set by a single Accept header.

Also, content negotiation is insufficient for literal values. Anything would fall under Accept: text/plain. A format is much more relevant using the schema that can select specific data types/values/structure.

Processes - Part 3 openEO Workflow requirement class define a new capability relying on the openEO process graphs (as-is) and well-known openEO processes (as-is) while integrating these into the OGC API - Processes framework.

For Part 3 openEO Workflow submission, a application/openeo+json, application/ogc+json and application/cwl+json should be considered for Content-Type. Right now, only OGC-based workflows can be sumitted as application/json https://github.com/opengeospatial/ogcapi-processes/blob/8c41db3fbc804450c89d94b704e1d241105272ae/openapi/schemas/processes-workflows/execute-workflows.yaml, and we wouldn't be able to distinguish those graphs if POST'ed on the same endpoint, unless we rely on specific JSON schema references/validation.

While this approach would allow referencing to some remote openEO process, it does not resolve the issue that inputs/outputs must somehow be mapped to parameters/returns. A workflow that would attempt combining OGC-based an openEO-based processes would need a way to translate their respective process descriptions to chain step outputs. Would that simply be left out as an implementation detail?

jerstlouis commented 1 year ago

@fmigneault

There is no automatic method to convert dynamically a format media-type/encoding/schema to an unknown command line argument of the underlying application

There is machinery between the process itself (which is not an HTTP API) and the OGC API - Processes implementation. This machinery could handle things like format transformation (e.g., using gdal_translate / or2ogr) based on process metadata about what the process executable unit itself accepts as an input or produces as an output. The process (including machinery between the API and execution unit) may be more capable in terms of input / output formats than the executable unit on its own. This is related to the point below. One approach would be to have the execution unit execution always expect a specific input and output format, and this is the one input described in the ProcessDescription and what the execution command line (e.g., for a Docker) corresponds to, which is included in the deployed application package. However, I think implementations have some freedom as to how to make this all work.

This is only true for sync and single output. The inputs/outputs are still required for async and multi-output executions because content negotiation is irrelevant, application/json is always returned. Typically, async is used for longer running processes that must be told in advance what is the desired output, and multi-output cannot all be set by a single Accept header.

That is not the case. We introduced the /results/{resultId} in #217 where an output media type can be negotiated, and this can still be done after the processing has completed, even if that format was not part of the execution request. The OGC API - Processes implementation could then use e.g., gdal_translate / ogr2ogr to convert the preserved output to the negotiated output.

I don't think that the outputs / format is being removed from the execution unit however, so for huge outputs where you do want to take advantage of async processing, it would still be valuable for clients to indicate beforehand their preferred format there before the execution, so that the later request upon completion does not still involve lengthy processing for the format conversion.

A workflow that would attempt combining OGC-based an openEO-based processes would need a way to translate their respective process descriptions to chain step outputs. Would that simply be left out as an implementation detail?

In the Processes-integrated flavor of openEO that I suggest, an implementation would automatically provide an OGC API - Processes style process description of the standard openEO processses at /processes/{processId}. How this is done would be an implementation detail. The implementation might support the openEO community standard representation at a different end-point and/or negotiating a different media type for an OpenEO process description, or not at all.

fmigneault commented 1 year ago

@jerstlouis

There is machinery between the process itself (which is not an HTTP API) and the OGC API - Processes implementation. This machinery could handle things like format transformation (e.g., using gdal_translate / or2ogr) based on process metadata about what the process executable unit itself accepts as an input or produces as an output.

Sorry if I was not clear, but this is what I meant by "no automatic method". An implementation has to provide explicit support and calls to specific tools such as gdal_translate / or2ogr to convert I/O. Pushing the same process definition onto another server implementation will not be portable unless they provided exactly the same logic.

I believe most of that machinery should be defined by respective processes themselves. Gradually, adding more convertion methods to build up a catalog of small "converter" procecesses for dedicated tasks. Instead of trying to support all possible convertion methods before/after the execution unit, one would instead create a workflow of processes using those reusable building blocks. Those processes would more easily be supported by all implementations, because there is no special convertion magic happening between the API and the execution unit.

IMO, the machinery in-between should limit itself to read value or pull href, validate their format/schema against the process description, and forward that directly to the execution unit. There should be as few conversion involved as possible. The simpler these steps are, the easier it is to obtain alternative and compatible process representations such as openEO, CWL, etc.

We introduced the /results/{resultId} in https://github.com/opengeospatial/ogcapi-processes/issues/217 where an output media type can be negotiated, and this can still be done after the processing has completed

This kind of assumptions are reasonable for cases where the negotiated media-types are easily convertible, and relatively fast to do after the execution finished, such as converting JSON to YAML or other examples you mentioned.

As you might have understood from all our previous exchanges, I am more often working with "heavy processing" use cases, where async is not only preferable, but a must. For example, if a process trains a ML model, which can produce either a PyTorch checkpoint vs an ONNX definition, or instead a process that produces a LiDAR point-cloud vs a Tiff rendering of that LiDAR depending on requested output formats. The "convertion" itself is an entire process execution call. In some cases, maybe it is not even possible to convert between outputs (e.g: no tool to convert directly, just a tool available to generate one format OR the other).

It is not realistic for a process to run twice in those cases, to convert "after the fact" a negociated media type by re-executing it, because of incurred resources usage and costs.

In cases were it would be possible to simply perform a media-type convertion with an existing output after execution, if the convertion logic was embedded in a separate process instead of in-between API/execution-unit, the same content-negociation on /results/{resultId} that you mention could be achived by submitting that output as input to the specialized convertion process. Chaining I/Os from processes this way also evoles more natually into workflows, where expected I/O format/schema must be known in advance for each step to reduce chances of a step failing because of failure to negociate an unexpected media-type/convertion logic.

jerstlouis commented 1 year ago

Pushing the same process definition onto another server implementation will not be portable unless they provided exactly the same logic.

There might be a different set of supported input / output formats from the deployed process (that would be reflected in the deployed process description). Executing the process might also result in slight differences in the outputs as a result of being converted by different tools. However, that same execution unit could be still deployed to those different implementations and work as intended, so I would not call that not portable.

I believe most of that machinery should be defined by respective processes themselves.

I am of the opinion that this should be up to the implementation / profiles / deployment to decide. Certainly with the Part 3 workflows the idea is that workflows can be defined in a manner agnostic of formats. One advantage of this is that a whole chain of processes could be happening in an optimal internal format of the implementation that never gets exposed, or kept in GPU memory for heavy CUDA parallel processing, without the need to explicitly encode / load things in a particular format until the very final step where the data is returned to a client. Not explicitly using processes for loading / converting / writing specific format can avoid superfluous steps in between the actuall calculations within a workflow.

As you might have understood from all our previous exchanges, I am more often working with "heavy processing" use cases, where async is not only preferable, but a must.

Those are the cases I was mentioning for which specifying an output format in the execution request beforehand makes sense, but they could be presented as different outputs altogether, or even as different processes.

fmigneault commented 1 year ago

There might be a different set of supported input / output formats from the deployed process (that would be reflected in the deployed process description). Executing the process might also result in slight differences in the outputs as a result of being converted by different tools.

If all the conversion logic of I/Os are embedded into the execution unit of the process (or separate processes in a workflow chain), there is essentially no reason for any corresponding process description to be different from one server to another. The execution unit would basically dictate what it is able to accept and produce. The process description only normalizes and abstracts away how those I/O are represented are mapped to the execution unit, such that whether the execution unit is openEO, CWL, WPS, or a plain docker is irrelevant.

Since conversion would be accomplished by exactly the same tools as pre/post process steps in a workflow chain, there should not be any difference in produced results. If there are variations (eg: rounding errors due to float), I would argue that would be a misbehavior from the server due to poorly described I/Os in the process description.

One advantage of this is that a whole chain of processes could be happening in an optimal internal format of the implementation that never gets exposed, or kept in GPU memory for heavy CUDA parallel processing, without the need to explicitly encode / load things in a particular format until the very final step where the data is returned to a client.

While there might be some cases were such optimization would be beneficial, the logic required in those cases is so specific to each individual process, their I/O combinations, and their execution units that it makes it again not possible to automate it. If the workflow becomes "locked" by this very specific convertion chain because specific I/O combinations and server running it must be respected exactly to take advantage of such optimization, I believe this simply becomes a "very large process" were the logic is spread out across different sources instead of being properly isolated. Data lineage and reproductibility of the processes is greatly reduced, if not impossible.

I would argue that if a process requires this kind of optimization, it would be easier and more portable for the execution unit to simply implement a "step1->step2" script directly (and all necessary convertions between those steps) to avoid the intermediate encode/save/load/decode. So again, from the point of view of the process description and API, there would not be any additional conversion logic, and that execution unit combining the steps could be executed on any server.

jerstlouis commented 1 year ago

If all the conversion logic of I/Os are embedded into the execution unit of the process (or separate processes in a workflow chain), there is essentially no reason for any corresponding process description to be different from one server to another.

I am of the opinion that nothing in Part 2 should restrict the possibility to automatically support additional input/output formats, and thus automatically enhancing the process description with additional format support compared to the execution unit's. This allows for example to keep individual executionUnits much smaller and simpler, working off a single input format (e.g., only packaging libgeotiff in the execution unit), while automaticallly enabling additional ones through shared machinery (e.g., GDAL with full support for all its drivers and dependencies) outside the execution units.

With Part 3 collection input / output in particular automatically requires such outside machinery as collection input implies an OGC API client that need to negotiate whatever format and APIs are supported by the remote collection which may not match the execution unit's, and collection output similarly need to support clients that support different APIs and will negotiate formats that will match the OGC API server implementation's support for different formats.

Particular use cases or profile may have a preference for the approach you mention, where there is a very thin layer between the Processes server and the executionUnit, but this approach should not be mandatory by Part 2 (or that makes Part 2 incompatible with a lot of Part 3 such as collection input / output). Part 1: Core says nothing about this of course because it is completely agnostic of the concept of execution units.

While there might be some cases were such optimization would be beneficial, the logic required in those cases is so specific to each individual process...

I'm not sure I understand what you are saying in that paragraph. In Part 3 workflows, the idea is definitely not to lock any combination, and it does aim to facilitate preserving data lineage and reproducibility. However, it allows automatic negotiation of how things will be playing out (not involving end-user client) at every hop between two nodes in the workflow (whether happening internally on the same server, or spread across two servers where one server acts as a client to the other).

So again, from the point of view of the process description and API, there would not be any additional conversion logic, and that execution unit combining the steps could be executed on any server.

I was also considering cases where this process can be spread across datasets and processes spread across different deployments (potentially of the same software having an affinity for a particular format).

While it is possible to create a single process that implements the full workflow (whether the components are all on the same server, or involves external Processes implementation), this single process can be implemented as a chain of processes, and this workflow chain of processes can also be exposed as the source workflow.

fmigneault commented 1 year ago

I don't see what Part 2 or Part 3 have to do with how convertion logic should be encapsulated in respective processes. Part 2 only allows to deploy them dynamically, while Part 1 would require them to be predefined on the server, but the principle applies either way. Part 3 allows to automatically chain them into a workflow, but you could still do a processing chain "manually" with Part 1 by executing a "core" process, followed by the "convertion" process using the output of the "core" process as input. The convertion machinery does not need to reside in the API.

This allows for example to keep individual executionUnits much smaller and simpler, working off a single input format (e.g., only packaging libgeotiff in the execution unit), while automaticallly enabling additional ones through shared machinery (e.g., GDAL with full support for all its drivers and dependencies) outside the execution units.

You seem to be describing exactly what I mentionned using small building blocks. My recommendation is that the shared machinery would simply be a gdal-converter process that accepts a bunch of input formats and can produce all their derivatives. The server doesn't need to have GDAL preinstalled. It does not even need to reside on the same server. If you want to allow a user to negociate a different media-type on /results/{outputID}, the logic under that endpoint could simply call gdal-converter under the hood with the original output, and return the converted one. The code under /results/{outputID} does not need to care about "how" to convert the data. It dispatches that operation to gdal-converter. At least, by having a dedicated gdal-converter process, you can now run that convertion logic with any desired input. You don't have to depend on previous job results to exist.

With Part 3 collection input / output in particular automatically requires such outside machinery [...] that makes Part 2 incompatible with a lot of Part 3 such as collection input / output

Again, that machinery could be a process. The "execution unit" of the workflow that needs to chain two processes with a collection could simply call an intermediate ogc-api-client-handler process that does the resolution of negociated types. The logic doesn't have to reside in the code of the API doing the workflow. The collection is a data structure just like any other, and a dedicated process that knows how to parse it and obtain requested results from it for the next process makes more sense. In that regard, that does not introduce any incompatibility with Part 2. If anything, it makes it more flexible because you now get more converter processes that can be combined in workflows however you want.

Part 1: Core says nothing about this of course because it is completely agnostic of the concept of execution units.

Even if "execution unit" is not explicitly in Part 1, the implementation will at some point run some code to do the processing. Call that however you want, but that code could be done in either of those methods:

API does convert_input -> Process does core_compute -> API does convert_output
Big Process does convert_input -> core_compute -> convert_output all in one
Process convert_input, Process core_compute and Process convert_output are called one by one (Part 1) or in a workflow (Part 3).

My recommendation is again to go for the 3rd approach, because this can be ported into basically any other server (especially if the process was obtained by deploy Part 2), without side effets or hidden logic coming from the API as in approach 1. Case 2 could be valid for optimization purposes, to avoid intermediate save/load, keeping items in memory, and so on, but I would employ that strategy only for rare cases that benefit from optimization because big logic like this is rapidly limitted in the number of cases it can handle. However, even the big process in case 2 could be chained within a workflow as the core_compute of case 3 to extent even more its supported I/O convertion. Therefore, there is really no reason to favor approach 1 in my opinion.

I'm not sure I understand what you are saying in that paragraph. [...]

The idea was that if you are using for example a GPU to do some processing, and that you want to leave the data in memory to allow it to be converted to somethig else, the convertion to be called would need very specific code to handle GPU logic and the specific convertion strategy for the input/output data format. If another process used CPUs instead, the same code would probably not work directly. Same for other data formats that need adapted logic. In other words, you would need a very specific implementation for every possible use case. Therefore, my point was that if you do have a use case that benefits from this specific implementation, you might as well package it is a dedicated process. For all other cases were preserving the data in memory this way would be negligeable, having dedicated processes that handle the convertion from one type to another, even if there are redundant save/load encode/decode between processes, would be much more scalable and portable across servers.

jerstlouis commented 1 year ago

I don't see what Part 2 or Part 3 have to do with how convertion logic should be encapsulated in respective processes.

Specifically the Section 8: Collection Input and Section 11: Collection Output requirements classes. See also Section 6.2.5: Considerations for collection input / output.

What you're describing is similar to the openEO approach that requires an explicit process to "load" something from the collection and "publish" a collection.

With Collection input and output we can write a workflow like:

{
   "process" : "https://maps.gnosis.earth/ogcapi/processes/RFClassify",
   "inputs" : {
      "data" : { "collection" : "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a" }
   }
}

(from https://maps.gnosis.earth/ogcapi/processes/RFClassify/execution?response=collection)

and access results triggering processing like:

https://maps.gnosis.earth/ogcapi/collections/temp-exec-2744D845/map/tiles/GNOSISGlobalGrid/12/1989/2656.png (map tile)

https://maps.gnosis.earth/ogcapi/collections/temp-exec-2744D845/coverage/tiles/GNOSISGlobalGrid/12/1989/2656.tif (coverage tile)

https://maps.gnosis.earth/ogcapi/collections/temp-exec-2744D845/coverage/tiles/GNOSISGlobalGrid (coverage tileset)

The "execution unit" of the workflow that needs to chain two processes with a collection could simply call an intermediate ogc-api-client-handler process that does the resolution of negociated types.

The intent with Part 3 Collection Input / Output is specifically not to require that.

Collection Output allows to present an OGC API Maps / Tiles / Coverages / EDR / Features / DGGS... as the front end, supporting content format/AoI/ToI/RoI/CRS/API negotiation on the output collection completely separately from the workflow definition.

If you extend this Collection Input / Output mechanism to how servers talk to each other, the communication can also be done entirely in an OGC API data access way. The servers do not need to act as Processes client for accessing results, they can instead use OGC API - Coverage requests to trigger processing on demand. They only need to POST the subset of the execution request intended for the remote server to /processes/{processId}/execution?response=collection to get back a collection that supports OGC API - Coverages and/or Tiles.

fmigneault commented 1 year ago

I am probably missing something...

What you're describing is similar to the openEO approach that requires an explicit process to "load" something from the collection and "publish" a collection.

Exactly, but I would do it using CWL and Docker apps in my case since this is what my server supports. There is however no need to "load" anything. The map/coverage tiles URL would simply be retrieved and passed down to the following process.

To implement collection I/O on my server, I would simply create a collection-parser process that knows how to perform whichever OGC API negotiation that can be supported to obtain resolved URLs to relevant resources. Behind the scene, the server would call that collection-parser process with the specified collection, and whichever tile output is obtained from that would be chained to the RFClassify that expects an input image.

The distinction I am highlighting is that, if I wanted to understand how collection-parser resolved the input collection, I would be able to do so by calling it directly, by itself, and without the following logic from RFClassify. I could also manually take the resolved tile result from collection-parser execution, and manually execute RFClassify with the corresponding tile image retrieved from the collection as input. That would be the same as if I directly found that image using the coverage API and passed the URL reference to RFClassify myself. The logic of "how to parse a collection" would not be hard-coded by the server between the HTTP request parsing and forwarding to RFClassify, it would be a dispatched execution request to a collection-parser process that holds this logic.

To convert the output into a collection format, that would also be some kind of collection-maker process. That process would take care of any necessary collection creation, registration of underlying resources, etc. from an input. In that example, its input would be the output image produced by RFClassify, and some lineage metadata from RFClassify execution that generated it.

jerstlouis commented 1 year ago

I am probably missing something...

Yes :)

With collection input / output in Part 3 workflows, the collection-parsing and collection-making is a pre-registration step that is done only once when first registering the workflow. That only happens when you click the Setup collection output button. This validates that the entire workflow is valid and set up negotiation for compatible APIs and formats between the different hop nodes (other the client will get a 400 Failure to validate workflow). It makes all components aware that they will be working together in that pipeline and are ready to roll.

All future requests for a specific AoI/ToI/RoI (or Tile or DGGS Zone Data) uses that workflow already registered (that can span collections and processes across multiple servers), and only triggers the processing workflow chain (which does not involve any "parse collection" or "make collection" step) for that specific region/time/resolution being requested. It will not be creating any new resources (no POST methods, all resources already exist virtually and their content gets actualized/cached the first time they are created with a GET, or beforehand if some server is preempting further request).

fmigneault commented 1 year ago

This feel like a shortcut naming convention (which is OK), but that could still be defined explicitly with a workflow like:

{
  "process": "https://{some-server}/processes/collection-maker",
  "inputs": {
     "process" : "https://maps.gnosis.earth/ogcapi/processes/RFClassify",
     "inputs" : {
        "data" : { 
            "process" : "https://{some-server}/processes/collection-parser",
            "inputs": { "source-collection": { "href": "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a" } }
            "outputs": {"image-found": {"format": {"mediaType": "image/tiff; application=geotiff" } } }
        }
     }
  }
}

The obtained workflow would be validated in the same manner, and could be executed by a CWL or openEO approach. Using Part 2, that workflow would be deployable instead of executed immediately.

jerstlouis commented 1 year ago

Yes you could do something like this, and that is using the Part 3 "Nested Processes" requirements class.

But whether you execut it or deploy it, when that workflow is executed it will process the entire input collection, unless you add a parameter to restrict the execution request for an AoI/ToI/RoI.

The whole point of Collection Input / Output is to have collections as first class objects where you do not need to specify API/AoI/ToI/RoI/format, and it allows you to express the workflow in a manner agnostic of all this.

The output collections exists for the spatiotemporal extent and resolution and all possible formats, and is accessible by regular OGC API clients like GDAL without having to actually process the whole thing, triggering processing on demand.

What you're saying essentially is that you can do things without Collection Input / Output. Yes, of course. It's an optional requirements class, and a server or client can support other things in Part 3 like Nested Processes and OpenEO process graphs without implementing Collection Input / Output.

And you can integrate implementations that support it with some that don't either by using something like collection-maker / collection-parser, or by using an "href" pointing directly to the coverage output for example.

fmigneault commented 1 year ago

Yes. Of course the example I provided is not complete. You would need additional parameters in collection-parser to indicate how to filter the collection and obtain the specific image-found output of interest. I focused more on the nested structure here.

What is important in that case is that I can easily map that nested OAP workflow structure to a CWL workflow representation. The same would be possible with an openEO processing graph after converting the inputs/outputs into arguments/returns. This is possible only because each processing component in the chain are encapsulated in their own Process. Their is no hidden conversion logic between the steps.

If collection-parser returned a COG (or any other data format) instead of a GeoTiff, I could very easily add a cog2geotiff-converter process in between of collection-parser and RFClassify to fullfill the input format needed by RFClassify. If the cog2geotiff-converter logic was not defined in a dedicated process, there would not be a way to guarantee that the workflow chain would resolve correctly, since the types wouldn't match collection-parser [cog] -> X -> [geotiff] RFClassify. Your server could somehow negociate the required convertion, but that workflow would 100% fail with other servers that do not, hence it is not portable.

Depending on servers to somehow automatically negotiate/convert the types between steps greatly increases chances of workflow suddently failing.

jerstlouis commented 1 year ago

You would need additional parameters in collection-parser to indicate how to filter the collection and obtain the specific image-found output of interest.

The idea of collection/input is you can represent the whole unfiltered input / output collections, preserving the ability to request small parts of it using OGC APIs. It's a late binding mechanism of these configuration options.

Because CWL (and openEO) do not have notion of an OGC API collection as a first class object, I don't think it would be possible to directly map a workflow making use of them to either.

However, a server-side implementation of Collection Input / Output could decide to map the workflow to an internal openEO or CWL workflow, taking additional parameters for AoI/ToI/RoI/format/API (or in the case of API or format, possibly selecting the appropriate helper processes for the task), which the Processes - Part 3 Collection/Input implementation could map to. For responding to client request for the not-fully-realized/on-demand Part 3 output collection, the Part 3 implementation would trigger that CWL or openEO workflow (which contains those extra processes e.g., to load a particular and convert to a particular format that the client expects) filling in the AoI/ToI/RoI/format/API parameters of that workflow to respond. This would be some of "extra machinery" of the API, but internally it could still use CWL directly or a pure Processes - Part 1 approach without the Collection Output first class object.

Your server could somehow negociate the required convertion, but that workflow would 100% fail with other servers that do not, hence it is not portable.

The workflow validation step which happens during registration would already perform the negociation and the idea is to report the failure before any actual processing is done. The negotiation happening at every hop has the client (or the server acting as a client) looking at the server conformance classes / capabilities and ensures the server supports an API / format working for the client side of the hop. It could also do further validation to make sure things work as expected. So if the workflow fails with another server, it will fail at the time of registering (POSTing the workflow to /processes/{processId}/execution?response=collection).

Depending on servers to somehow automatically negotiate/convert the types between steps greatly increases chances of workflow suddently failing.

My view is exactly the opposite. Not requiring a particular format at a specific step of a workflow greatly helps the chances that the client and server side of the hop can find a common ground between their respective capabilities. e.g., if I enforce GeoTIFF and EDR at a particular hop, and either the client and server does not support GeoTIFF, the workflow validation will fail. But if I leave that open, maybe they will find out that they both support JPEG-XL and OGC API - Tiles and can interoperate that way. Then I can take the same workflow and change one side of that hop, and this new hop now is able to operate with GeoTIFF and Coverages. Only the collection or process URL had to be changed in the workflwo execution request, everything else stays exactly the same. As an end-user client, I don't have to bother about figuring out which format / API each hop supports. I just discover compatible OGC API Collections and Processes and can easily assemble a workflow this way.

If a hop is not interoperable (no common ground on API / format), this feedback is received as workflow validation failure before trying to do any processing in the regsitration step, and the workflow can be fixed.