Clarify input/output field modifiers of Part 3 collections

opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes

Other

48 stars 45 forks source link

Clarify input/output field modifiers of Part 3 collections #426

Open fmigneault opened 4 months ago

fmigneault commented 4 months ago

When looking at the definitions (and subsections) of:

All the same terminology, field names (filter, properties, sortBy), and intention for each are reused. This makes it very hard to understand "where" each of those fields are expected.

Furthermore, the only available example (https://docs.ogc.org/DRAFTS/21-009.html#_coastal_erosion_susceptibility_example_workflow) mentions that the input/output modifiers are both used, which doesn't help disambiguate them.

Adding to the ambiguity, the Input Field Modifiers are used to request/filter a certain collection (possibly remote), for which the (output) resulting items/features/tiles/coverages from the search are used as input for the process. On the other hand, (to my understanding), Output Field Modifiers would be used to perform further filtering/sorting/derived-values from a resulting collection from the processing, to be made available for another step, or final workflow result. Since each of these pre/post-filters could be interchanged in some cases, or can be seen (implemented) as sub-processes themselves with inputs/outputs, the operations and applicable requirement classes rapidly become (in their current state) confusing and undistinguishable.

Explicit examples (on their own) demonstrating how Input Field Modifiers and Output Field Modifiers must be submitted in an execution request would help understand the intention.

Something to validate (I'm assuming here): Are the Output Field Modifiers supposed to be provided under the outputs of the execution request?

jerstlouis commented 4 months ago

I fully agree this needs (a lot) more examples.

Adding to the complexity, in scenarios where the APIs involved support those parameters at both the input or output level, the extra processing could be performed at either the "output" or at the "input" level.

Are the Output Field Modifiers supposed to be provided under the outputs of the execution request?

We should review that, currently it's defined directly within the top-level "process" object, or a nested "process" or "collection" object. We need to work out examples with processes returning multiple outputs and how that would work. We had already discussed how that works generally with Part 3 - nested processes in one or more of the existing issues...

In the coastal erosion workflow example, the use of "properties" everywhere are an example of Output Field modifiers, at the level of each nested sub-process before being fed to the "data" input, as well as at the level of the top-level PassThrough process to compute the final suceptibility value.

The "properties" at the same level as the "Slope" and "Aspect" processes could be considered as either "Input" or "Output" field modifiers.

This makes it very hard to understand "where" each of those fields are expected.

The filter, properties, sortBy properties are expected within the same object containing the "process", "collection", "href", and "value" properties.

When they are at the top-level process, they are necessarily "output field modifiers".

When they are within inputs to another process, they can be supported in the workflow either as input and/or as output field modifiers.

For a Collection, they would always be considered "Input Field Modifiers". For a nested process, they are both "Input and Output field modifiers at the same time".

The remote API supports resolving them (e.g., filter= parameter in Coverages - Part 2 or Features - Part 3, properties= parameter in Coverages - Part 2 or Features - Part 6, or Processes - Part 3: Output Collection "Output Field modifiers"), then the work can be done on the remote API side.

Otherwise, the Processes - Part 3 implementation for the local process invoking those remote collections / processes can perform the extra processing after retrieving the collection or process output data and before passing it to that local process -- here we could think of it either as modifying the output of the nested process, or modifying the input to the parent process.

The general idea is that the extra processing can be performed at the end that supports it and is most optimal.

fmigneault commented 4 months ago

The filter, properties, sortBy properties are expected within the same object containing the "process", "collection", "href", and "value" properties.

For a nested process, they are both "Input and Output field modifiers at the same time".

I think that might be where most confusion comes from.

To my understanding (correct me if I'm wrong), the following example is what would be expected with Part 3 to filter the Collection Input by cloud cover, filter the Collection Output of NestedProcess by the datetime, and apply some sorting to the Collection Output of MainProcess for reporting as the final output :

{
   "process": "https://server.com/processes/MainProcess",
   "inputs": {
     "process": "https://server.com/processes/NestedProcess",
     "inputs": {
       "collection": "https://server.com/collections/TheCollection",
       "filter": {
         "op": "lte",
         "args": [{"property": "eo:cloud_cover"}, 10]
       }
     },
     "filter": "T_AFTER(datetime,TIMESTAMP('2024-01-01T00:00:00Z'))"
   },
   "sortBy": "eo:cloud_cover"
}

It somewhat makes sense for processes that have only 1 output, since we can infer that the filter and sortBy at the same level as process apply to it, and this output becomes the input of the parent process.

However, I think embedding the filters in outputs, although more verbose, would clarify the intention. It would also be useful in the case of multi-output processes, since it can serve as the output selection at the same time. Finally, I think it can add clarification for requesting the Collection Output using response (or format: collection?), since the response=collection as query parameter affects only the top-level process execution IMO, and the NestedProcess response format remains ambiguous.

The above workflow would become the following with the verbose form, where output and result match the exact output ID from the corresponding process descriptions :

{
   "process": "https://server.com/processes/MainProcess",
   "inputs": {
     "process": "https://server.com/processes/NestedProcess",
     "inputs": {
       "collection": "https://server.com/collections/TheCollection",
       "filter": {
         "op": "lte",
         "args": [{"property": "eo:cloud_cover"}, 10]
       }
     },
     "outputs": {
        "output": {
          "filter": "T_AFTER(datetime,TIMESTAMP('2024-01-01T00:00:00Z'))",
          "response": "collection"
        }
     }
   },
   "outputs": {
      "result": {
        "sortBy": "eo:cloud_cover",
        "response": "collection"
      }
   }
}

jerstlouis commented 4 months ago

The example looks good. We should clarify requirements regarding CQL2-JSON vs. CQL2-Text since you're using both there -- I would assume that both should be allowed if the server declares support for both.

and the NestedProcess response format remains ambiguous.

This is really intentional, allowing the implementation of MainProcess to decide how to invoke the nested process (using Part 1: Core, or Part 3: Workflows collection output if it's available, and using Tiles, DGGS, Coverages as supported by the client/server on each end...).

While I understand from a "reproducibility" perspective the more explicit the workflow is the more likely the results are going to be identical, from a "reusability" perspective the simpler the workflow is expressed, the more likely it is to be re-usable on another set of deployments which may support a slightly different set of API / requirement classes, without requiring any modification. And if everything is done right (important if), the results should still be largely reproducible (within a small very acceptable threshold).

It's also about the endpoints along the workflow knowing best what will be more optimal than the client building the workflow, and allows the immediate server that the clients submits the workflow to to act as an orchestrator and/or optimize the workflow.

fmigneault commented 4 months ago

clarify requirements regarding CQL2-JSON vs. CQL2-Text since you're using both there

Indeed. A filter-lang=cql2-json or filter-lang=cql2-text would be needed.

This is really intentional, allowing the implementation of MainProcess to decide how to invoke the nested process.

I'm not sure to like this idea (for that specific case). You nailed exactly my concern. From a reproducibility perspective, we have basically no idea what is going to happen. It could even be an issue if the referenced process lives in a remote OAP that does not support Part 3. Execution would "work" on the end of that remote process execution, but the received "output" would not be a collection, causing the filter requirements to fail the whole workflow chain. Informing explicitly that a response=collection is needed for that step would allow the local OAP to validate the workflow beforehand, including checking if the remote OAP supports Part 3. However, if there was no filter directive, I agree that allowing the auto content/format negotiation is preferable.

jerstlouis commented 4 months ago

A filter-lang=cql2-json or filter-lang=cql2-text would be needed.

I don't think that's absolutely necessary, unless we want to consider alternative languages, since it could easily be distinguished by being a JSON object vs. a JSON string.

It could even be an issue if the referenced process lives in a remote OAP that does not support Part 3. Execution would "work" on the end of that remote process execution, but the received "output" would not be a collection,

That is actually already considered in the specification and is exactly what the "Remote Core Processes" requirement class is all about. A Part 3 implementation that supports this is able to execute a remote process implementing Part 1 and integrate it within a Part 3 workflows. So if that Part 3 implementation supports "Remote Core Processes" + "Input/Output Field Modifiers", then it is able to execute the remote process, then apply the output modifier to the output of that process (for the final output, or as input to other processes in the workflow). If a filter is used to the input to that process, then the Part 3 implementation would need to apply the filter input field modifier to the input before passing it within the execution request when executing the process the sync/async Part 1 way.

The Part 3 implementation supporting "Collection Output" is also able to make the overall output work like a collection, even though the Part 1 process part of the workflow does not support that, as long as the process has a way to somehow injecting the bounding box (and ideally resolution and time as well) filters for the part 1 process. This is the functionality which we currently partially support with our MOAWAdapter, though in theory this MOAWAdapter process should not be necessary and it should "just work". It is tricky mostly because there is no clear concept of identifying parameters for bounding box / time / resolution of interest in Part 1, except for the special "bbox" input type which could be assumed to serve that purpose. Even if the process has no "bbox", the server could still present it as a collection by processing the whole thing first, but this would not scale with a large dataset and registering that workflow would either fail or need to wait until the whole thing is processed to know whether the processing succeeds or not, and obtain the extent information for the output of that Part1 process etc.

For implementations using nested remote processes that already support "Collection Output", it is not necessary that the Part 3 implementation supports "Remote Core Processes" if it supports "Remote Collection Input" instead. That is because it can treat the remote process just like a remote collection, with the exception of submitting the partial ad-hoc workflow to initially activate that remote virtual collection. This is the ideal way to chain these remote collections, which is also great for caching in combinations with OGC API - Tiles or OGC API - DGGS.

causing the filter requirements to fail the whole workflow chain.

The key thing with ad-hoc workflow is when the client initially submits the ad-hoc workflow (and in all hops within the workflow there is a client and a server, so it is a recursive process), the workflows are immediately validated based on the available capabilities, and a validation result is returned whether the workflow will succeed or not. If there is a missing capability e.g. , no support for "Remote Core Process" on a client side and no "Collection Output" on a server side on a particular hop, then that workflow might fail.

However, an interesting thing is that if any hop higher up in the workflow chain has a little bit of orchestration functionality and itself does support "Remote Core Process", it could detect such mismatch ahead of time, by querying the conformance declaration of the services involved deeper in the chain, and could re-organize the workflow to itself act as a client for that Part 1 process execution, and submit the input to the parent process by providing either through a virtual input collection, or by executing the parent process in the regular sync/async way with an "href" or embedded "value" within an execution request.

So that orchestrator process up above would save the day and the workflow could still validate successfully. I really think of this kind of flexibility, which again you might well point out as introducing reproducibility issues, as a feature rather than a bug! :) I strongly believe that the kind of increased interoperability and re-usability that will emerge out of this vastly outweighs the reproducibility concerns, which I think can easily be addressed on a case-by-case basis to ensure that regardless of which path is taken, the results are within a very small margin of difference, if not identical.

An example of reproducibility difference is the use of data tiles or DGGS zone data queries (and the use of a particular 2D Tile Matrix Set or Discrete Global Grid Reference System). This involves a particular way to partition and up/down sample data, so some small differences are to be expected. But both of these approaches brings significant performance/caching advantages which are well worth these issues, and if the data is sampled right and always ensuring that the up/down sampling does not significantly degrade the correctness of the data, the final results should be well within the acceptable margins compared to using e.g., a Coverages subsetting request for the entire area.

Being able to use almost identical workflows with different combination of deployments and servers, even if they support diffect OGC API data access mechanisms, 2DTMS, DGGRS, encoding formats etc., will actually help to validate and compare outputs of the same workflow with more implementations, datasets, AoIs etc, which I believe in the end will actually help reproducibility.

fmigneault commented 4 months ago

I don't think that's absolutely necessary, unless we want to consider alternative languages, since it could easily be distinguished by being a JSON object vs. a JSON string.

Indeed. This is exactly what I'm doing ;) (https://github.com/crim-ca/weaver/pull/685/files#diff-d25d3121a794cd4fb10b0d700f8df011035c957d4a19ef79d051b3c70bdefbc3R1501-R1502) However, I would add them in specification examples to make things more explicit.

That is actually already considered in the specification and is exactly what the "Remote Core Processes" requirement class is all about. [...]

To my understanding, Remote Core Processes only indicates that the Nested process to be on another server. The "Input/Output Field Modifiers" allow additional filtering of the result. So, the server would need to support "Collection Output" explicitly as well. As it stands, "Input/Output Field Modifiers" does not imply "Collection Output" necessarily.

I can see that adding "filter" can hint at "Collection Output" being used, but that could also be used for other cases than response=collection. I'm thinking about extensibility, here, where there could be other kinds of response also taking advantage of filter.

Another issue I just realized is that Part 3 adds response=collection as query parameter, instead of simply reusing the one already available in the body: https://docs.ogc.org/is/18-062r2/18-062r2.html#toc32. If response property was allowed within an output definition for Collection Output as well as within Nested Process, it would make the execution request more consistent.

The key thing with ad-hoc workflow is [..] the workflows are immediately validated based on the available capabilities, and a validation result is returned whether the workflow will succeed or not.

I think this is a strong assumption that it can be accomplished. If, for example, there is a chain of nested processes (root <- proc1 <- proc2) where it is expected that each step passes a Collection input/output, there is no way for the root <- proc1 process to guarantee the results will be a Collection, since it does not know "yet" what proc2 will return. It can try to validate if I/O "align" in capabilities, response type, and media-types, but this could change once the actual execution happens (because servers do not always do what they advertise or resolve "defaults" differently due to allowed flexibility).

Because there is a chance of ambiguity (and for which the API could refuse to execute "just in case"), Part 3 must allow parameters such as response that could allow the client to explicitly resolve the intention. To be explicit, I'm not saying response: collection would be required (we want to keep the flexibility/auto-resolution capability), but it cannot be assumed always possible. The client must have the option to indicate it.

jerstlouis commented 4 months ago

To my understanding, Remote Core Processes only indicates that the Nested process to be on another server.

That is correct.

As it stands, "Input/Output Field Modifiers" does not imply "Collection Output" necessarily.

That is also correct.

The "Input/Output Field Modifiers" allow additional filtering of the result. So, the server would need to support "Collection Output" explicitly as well.

Why do you arrive at that conclusion? After invoking the remote Part 1 process, the Part 3 implementation supporting Input/Output field modifiers can perform the additional filtering/deriving/sorting operations itself.

If response property was allowed within an output definition for Collection Output as well as within Nested Process, it would make the execution request more consistent.

The idea was to not specify the execution mode (collection output, sync, async) in the execution request, for the same reason that a Part 1 execution request can be executed sync or async with the Prefer: header only. This allows processes servers of the intermediate hops to decide on their own how then want to invoke the deeper processes, based on what capabilities are supported there. In particular, this considers orchestration scenarios where the workflow could be re-organized.

I believe "response" (raw/document) is gone from the execution body in 1.1/2.0.

there is no way for the root <- proc1 process to guarantee the results will be a Collection, since it does not know "yet" what proc2 will return.

When a Part 3 / Collection Output implementation handles an initial workflow registration, it needs to validate that the nested processes will work as expected, and this is largely why the validation capability would be very helpful for this purpose. In our MOAWAdapter implementation, what we do at the moment is submit a small portion of the BBOX to do a quick processing test to know whether the execution will succeed and what it will return before we successfully return a collection with a level of confidence that things will work.

But in general, it is the Part 3 Collection Output implementation that creates a collection. Regardless of what is returned by the processes underneath, it presents the final output as a collection.

fmigneault commented 4 months ago

Part 3 implementation supporting Input/Output field modifiers can perform the additional filtering/deriving/sorting operations itself.

You're right. It simply requires the server to return "something" that is filterable. I guess my assumption was driven by the very natural relation between collection and filter while reading Part 3.

I can see an issue however regarding that situation.

If a server wants to support filter ONLY when a Collection Output is implied (since it needs to handle the process output anyway that would be a Core value/href representation otherwise), it doesn't really have any way to indicate that. In order words, if both conformance/requirements are provided for Collection Output and Output Field Modifiers, we cannot know if they are mutually inclusive, or whether modifiers apply anywhere. It might be relevant to have a requirement that only applies to collection, and another "generic" one for value/href only. Or, should that just be up to the server to immediately respond HTTP 400 or HTTP 501 with details about unsupported filter for certain combinations? If so, that could be an additional mention in the requirements (which codes, what error to report, etc.).

not specify the execution mode (collection output, sync, async) in the execution request

I think this is mixing things here.

The execution mode sync/async is irrelevant IMO. The server interrogates the remote process however needed, and obtains the value directly or monitors the job and retrieves it after. This is not important for chaining the steps. However, the response structure (ie Prefer: return=minimal|representation) matters a lot to chain the processes correctly.

I must say I find it extremely ironic that we went through all this issue about Prefer replacing response: raw|document, just to end up reintroducing response=collection for Collection Output as an alternate representation of the response.

I also strongly believe that it is not enough to simply pass collection at the top-level of the execution request, whether using ?response=collection query, a body response: collection or some alternative with Prefer header. There are a few reasons for that:

A client needs to have the option to say that they want the "full workflow" to return response: collection for individual outputs, and that intermediate steps should use response: raw|document/Prefer: return=minimal|representation outputs to chain as intended. Note, I do not say they have to provide them to allow flexibility, but they need some way to do it. This is useful if the server cannot resolve an ambiguity automatically, and the client can "hint" it in the right way.
The response: raw|document|collection within a Nested Process or Collection Output could make a big difference regarding parsing of the result. Since each representation could be represented as a different JSON, this impacts the filter modifier that must be provided for each case, and therefore having a "selection" of the response format can be critical for the intended filtering.
If the workflow happens to have 2+ outputs, where some are used for the Collection Output and others are obtained by link (e.g.: some additional metadata files), there is no way with only ?response=collection to indicate which output goes where. I do not see any reason for Collection Output capability to be limited to 1-output processes (Requirement 19B), since the collection is indicated by Location anyway. The response could have additional Link for the other relevant outputs.

it needs to validate that the nested processes will work as expected

I'm not sure how to explain this differently, but I don't think it is always possible once a certain number of workflow steps is reached, especially if some steps imply data manipulations such as merging bands, aggregating items or conditional logic based on inputs, which can change the outcome of how the workflow should execute at runtime.

If those steps are not very limited in the workflow, you simply have no idea what the actual result will look like much later on until it is executed, because they depend on what will be produced on the previous steps. Therefore, you cannot "validate" the workflow as a whole. You can end up in situations where two processes that would seemingly be chain-able with a matching subset of media-type/format I/O do not work anymore once reaching their actual step execution because I/Os to chain were modified by previous logic conditions.

Even when explicitly indicating the desired type, format, response, etc. for each step, validation cannot always be guaranteed. However, it tremendously helps to reduce erroneous resolution path-ways.

jerstlouis commented 4 months ago

"something" that is filterable.

Correct, and feature collections and coverages are both filterable and derivable (properties of features correspond to the range fields in coverages, where individual cells can be filtered out).

If a server wants to support filter ONLY when a Collection Output is implied .. it doesn't really have any way to indicate that.

As it stands, support for input / output fields modifiers is always (not only for Collection Output -- it is an orthogonal capability), regardless of how the input was received. If the remote process or collection has some matching "filter" or "properties" parameter to do the work, the implementation is encouraged to use it as it may speed things up by transfering less data overall, but there is no expectation that it needs to do so. It can be thought of as an optimization compared to always doing the work itself, as if executing a remote process or fetching a file from an href the Part 1 way.

The execution mode sync/async is irrelevant IMO

For me "collection output" is a third execution mode just like sync and asyc, and is also irrelevant (except it does make a lot of things easier when using it, such as filling in the AoI/ToI/RoI on-demand).

the response structure (ie Prefer: return=minimal|representation) matters a lot to chain the processes correctly.

Why does that matter? As long as you have a clear way how to retrieve the output once things are ready...

just to end up reintroducing response=collection for Collection Output as an alternate representation of the response.

response=collection is really much more similar to the sync/async execution modes than the raw vs. document distinction.

This is useful if the server cannot resolve an ambiguity automatically, and the client can "hint" it in the right way.

For me this goes against the design. There is no reason why the client/servers along each hop couldn't negotiate with them the best way to do things. The principle here (which you might disagree with) is that the implementations knows best -- not the user. All the user should be doing is expressing their workflow in the simplest and most natural way possible, and leave it up to the implementations to figure out the best way to do it (at each hop).

Since each representation could be represented as a different JSON, this impacts the filter modifier that must be provided for each case, and therefore having a "selection" of the response format can be critical for the intended filtering.

I am really not following how minimal vs. representation (raw vs. document) matters at all here. These things conceptually do not modify at all the information being returned. They only change how a Processes client goes about retrieving the information. If the client gets a link back, it has to perform an extra GET operation to get to the actual data. This really has no impact at all on filter or any of this. Also since it is a preference, there is no obligation of the server to apply the preference anyways. Maybe you can help me understand what your thinking is here, I must be missing something...

If the workflow happens to have 2+ outputs,

For 2+ outputs we have response=landingPage where the process can return a full OGC API landing page that can have a /collections where each output is a separate collection.

Regarding using Link, it is also because response=collections returns a 303 re-direct, which the client can then use to proceed as usual. e.g., the GDAL OGC API driver supports process excution this way right now.

Retruning a single landing page Link for multiple outputs would more similar to accessing an OGC API that has multiple collections.

you simply have no idea what the actual result will look like much later on until it is executed, because they depend on what will be produced on the previous steps.

The Part 3 workflows are really rooted in the concept of emergence:

In philosophy, systems theory, science, and art, emergence occurs when a complex entity has properties or behaviors that its parts do not have on their own, and emerge only when they interact in a wider whole.

where I really strongly believe that we will be able to achieve very powerful things with it, as long as we keep to the concept of each individual part having a well defined interface which are assembled together in a simplistic way.

The OGC APIs provide the connectivity between distributed data and processes, and we will have as a result a system-of-system where all deployed OGC API implementations can work together as a single global distributed geospatial computing powerhouse.

The goal is instant integration, analytics and visualization of data and processes available from anywhere.

Each process has a process description, including the supported output formats, and each deployment declares its conformance to the different supported OGC APIs and requirement classes, so it should be possible for each hop for the implementation acting as a client to know exactly what it can get back from the server it will invoke.

There is always a "simple" way to execute a Part 3 workflow, where you simply pass along the nested process object to the remote process, and only do your part. Alternatively, a server higher up in the chain could decide to take on an orchestration role and re-organize things to improve efficiency.

In any case, validating should be possible:

If everything supports "Collection Output", everything can be validated at the time of submitting the ?response=collection
For processes not supporting Collection Output, either a validation execution mode as discussed can be used, otherwise a very small test input can be passed to validate that the process is indeed able to accept the execution request, and to obtain more information about what the output would look like (perhaps some of those considerations could be considered as additional things for the "validation" mode).

The use of GeoDataClass would also greatly help in terms of knowing exactly what to expect in the response (a GeoDataClass implies a particular schema i.e., which fields you will get back that you will be able to filter or derive from, including the semantic definition information), which are things that the process descriptions might not otherwise cover in terms of which bands you will find in a GeoTIFF etc.

fmigneault commented 4 months ago

input / output fields modifiers is always [...]

The issue with this is that using Collection Inputs/Outputs without filters support is almost useless (hence why it is easily implied together), because there is very often a need to subset the collection for the processing or chaining purpose. I believe that filtering a collection could be done on its own fairly easily, without much impact to the rest of the capabilities. Having to deal with all modifier combinations for "normal" value/href inputs/outputs however is an entirely different kind of implementation, since there are many more implications depending on the formats.

If the remote process or collection has some matching "filter" or "properties" parameter to do the work, the implementation is encouraged to use it

For this I 100% agree, it is very easy to "handoff" the work to the remote location, but here we were referring to the case where the remote process or collection does not support it, and therefore it must be handled by the local process instead. Because that case is a possibility, it places a bigger implementation burden on the local process, since it must try to handle all combinations. And no server is ever going to handle all combinations.

For me "collection output" is a third execution mode just like sync and asyc response=collection is really much more similar to the sync/async execution modes than the raw vs. document distinction.

That doesn't make much sense to me. Since the process can run either in sync or async, and return a collection link in the response in both cases, it is not a distinct mode. I think it is easier to consider it as a special media-type for a link reference over a new mode. Since a link is the expected response from a collection response, I do not see why it should be considered any differently from value/href. It just has "extra flavor" with OGC APIs that allows you to query it more after, but so could a Core href if the server wanted to support it.

I am really not following how minimal vs. representation (raw vs. document) matters at all here.

If a JSON representation {"output": {"href": "https://..."}} was returned instead of a typical GeoJSON FeatureCollection, a filter referring to specific properties expected in the GeoJSON would parse the result completely wrong. Since it is not explicit which server (local/remote) should perform the filter parsing (because both are allowed by inputs/outputs modifiers), the execution must try to make educated guesses. In some cases, it might be obvious (great if so), but in others not. The JSON structure could be very similar in both cases, but not have the exact filtering properties needed to resolve correctly.

If the workflow was executed using {"response": "collection", "filter": "..."} as the nested process output, that makes it very explicit that the nested process is expected to return a collection response, and should be filtered as such. As previously mentioned, inputs/outputs modifiers do not imply collection necessarily, so the other way around where the filter is expected instead be applied with the output/href could be a valid use case as well. Omitting "response": "collection" in this case could be a sensible default, or response: raw|document" could be used explicitly as well. In this kind of situation, response embedded in the nested output would help resolve where filter applies.

For 2+ outputs

The thing is that not all processes return links that makes sense as a "Collection". Imagine a process that returns result, and report. The result could be some annotated georeferenced points of interest, perfect for a collection output, whereas report would be a summary analysis spit out by ChatGPT about those points. The process cannot be called separately for each output, since they are dependent of each other. To take advantage of Part 3 Collection Output, I would need something like:

{
  "outputs": { 
    "result": {"response": "collection", "filter": "<points I care about"},
    "report": {"transmissionMode": "reference"}
}

And now, I can get both HTTP 303 Location for the result collection and Link to the report simultaneously. If I tried simply to ?response=collection, that process could only tell me "sorry, don't know how to encode a ChatGPT report into OGC API collection".

The Part 3 workflows are really rooted in the concept of emergence:

Maybe, but in the end, I just want the process to succeed execution 😅 If it nags me that it cannot resolve how to parse the workflow, I must give it a hand. The reality of complex data structures and increasingly convoluted workflows is that it generates a lot of ambiguity, and sometimes you just need to define some things to shine a light on the intent (there wouldn't be so many workflow languages otherwise!).

I reiterate, parameters like response must be an option, not a requirement. If the operation is able to auto-resolve without it, then all good!

If everything supports "Collection Output", everything can be validated at the time of submitting the ?response=collection

It is a noble goal, but I highly doubt that is a fact.

jerstlouis commented 4 months ago

it places a bigger implementation burden on the local process, since it must try to handle all combinations. And no server is ever going to handle all combinations.

There is really only one combination, which is the ability to filter (or derive) data (of all the data types supported by the processes). There could potentially be some exceptions allowing to return a not implemented HTTP code for specific cases, if you only support filtering on Feature collections and gridded Coverages, but not point clouds for example. It also means to support this for all input/output formats that are supported by the engine, but if some format conversion engine is in place, that should not be a big burden.

The ability to pass along a "filter=" or "properties=" to a remote end-point to handle the filtering is there if the remote server supports it, but not all Processes implementation will support input/output field modifiers, and not all Features and Coverages implementations will support it. Therefore implementations will need to support doing the filtering on their own anyways, so that it can be applied to local processes and collections, and to remote collections and processes that do not implement this filtering / deriving. So this local modifiers capability is necessary anyways, and can be used for any fallback scenario where it can't be done on the remote side.

That doesn't make much sense to me. Since the process can run either in sync or async, and return a collection link in the response in both cases, it is not a distinct mode.

The most important distinction with "Collection Output" is that it can support "on-demand" processing of a particular Area/Time/Resolution of interest.

So in the majority of cases where I expect this to be used, where localized processing for an ATRoI is possible, it would not be the same as a sync/async execution that produces the whole collection.

I do not see why it should be considered any differently from value/href. It just has "extra flavor" with OGC APIs that allows you to query it more after, but so could a Core href if the server wanted to support it.

An "href" to some data automatically implies a specific area / time / resolution of interest. The Collection concept leaves that open: "here's the input I want to use", implying to use the relevant parts based on what is currently being processed.

If a JSON representation {"output": {"href": "https://..."}} was returned instead of a typical GeoJSON FeatureCollection, a filter referring to specific properties expected in the GeoJSON would parse the result completely wrong.

I'm confused about what you're saying here. As we said earlier, the processing mechanics is aware of these distinctions and knows at which point it has the actual data. The filtering on the features is always applied to the actual data, whether it's in GeoJSON, Shapefile, GeoPackage... It would of course never be applied on the JSON "document" (results.yaml) response which is just links to the process results. Is that what you are concerned about?

Imagine a process that returns result, and report. The result could be some annotated georeferenced points of interest, perfect for a collection output, whereas report would be a summary analysis spit out by ChatGPT about those points.

If it makes sense to execute this process in a localized manner (which really is what makes for the perfect scenario for collection output), we should consider that the actual processing would be done several times for different ATRoIs. I imagine the issue here is that it is a "summary" report only for that particular subset? Would that really be useful? If it contained per-point information, then this information could be embedded as properties of the point features. If useful, summary information could also be added as additional metadata in the feature collection subsets, but that would not fit so well across different formats. The report would be different for every /items or /items/{itemId} request that you make for the collection, since it processes something different. (in general access by {itemId} is not very well suited for on-demand processing -- /items?bbox= or /coverage?subset= is a more typical example of on-demand processing).

If this is really not a localized process, and the purpose of the collection output is not so much the on-demand processing, but just the convenience of having an OGC API collection supporting OGC API - Features as a result, then this is slightly different than the main use case for "Collection Output". We have a similar scenario with our OSMERE routing engine, where the route is not an on-demand thing but must be fully calculated before we can return the resulting feature collection for the calculated route. If we still want to return this summary report and execute the process using Collection Output, one solution might be to provide the summary report as metadata linked from the collection description, rather than as a separate output.

The reality of complex data structures and increasingly convoluted workflows is that it generates a lot of ambiguity, and sometimes you just need to define some things to shine a light on the intent (there wouldn't be so many workflow languages otherwise!).

I understand that this may be the case, and I admit that I might be overly optimistic and idealistic. However, I do feel that the simplicity by which you can nest processes and collections with Part 3 workflows, the OGC API access mechanisms, and GeoDataClasses corresponding to a particular data schema, can avoid most of this ambiguity and convolution arising with trying to define everything explicitly. A process takes inputs, produces some outputs, and you can plug the output of one process as an input to another process. It should just work, as long as the interface contracts are respected.

It is a noble goal, but I highly doubt that is a fact.

When you request a collection output from a process, it validates the execution request, and if everything is good it also submits the immediate sub-workflow to any external process, which will return it a collection description if it itself validates, and sees if there is a match in terms of OGC API data access mechanisms and supported formats, and make sure that the returned collection descriptions are a match for the inputs they are used for in terms of GeoDataClasses / schemas of the content. This all happens before any actual processing is done (at least for processes which can be localized to an ATRoi).

If no match is found anywhere within the workflow, the validation of the problematic hop(s) will immediately fail, causing the validation of the whole workflow to fail. If the workflow fails given all the flexibility of using any supported formats and OGC API data access mechanisms to satisfy the request, there is nothing the user could do to make it work. The main reason this would fail is because the servers are not compliant, or the user picked incompatible processes and/or data sources. When using a workflow editor aware of part 3 and certified implementations, this should be a very rare occurrence due to resources temporarily unavailable or the occasional bug to be filed and fixed.

Of course this is still quite theoretical at this point given the limited experimentation done with Part 3 so far, but I hope we can prove this year that it can, in fact, work like a charm at least most of the time! ;)

Thanks a lot for the deep dive into Part 3 and helping validate all this!

fmigneault commented 3 months ago

There are different ways to execute each process, and what would be the most appropriate/efficient way to represent their outputs. What is considered the best for one, might be the worst for another. The question is not really whether it is ideal or not, but rather that there is sometimes a need to handle some atypical cases. When the process is "some unknown docker/script" that someone developed and just wants to run to test something out, sometimes the auto-resolution of the server is not what the user wants. Sometimes, the expected resolution by the user cannot be predicted correctly by the server.

In some situations, the reality of the workflow design procedure, such as the rapid development for AI to publish papers, makes it such that, if the user was told they need to redesign everything because it does not fit the auto-resolution pattern, they would simply give up and move on elsewhere, as they cannot be bothered or don't have the time. They might not care about portability or reuse, they just want the raw data out. This is why I'm pushing strongly to have options like response to "hint" a server toward what is expected.

One example I can think of is some processing workflow that needs STAC Item representations to exact specific assets. Because STAC is also OGC Features compliant, a remote collection could be resolved as either case. Depending on which server I send this workflow, some implementations could prefer the STAC resolution, while others could prefer OGC Features. Neither is "wrong", it's just a matter of preference/design for each server. Similarly, depending on their capabilities, any filter/properties specified could be handled either locally or remotely. If I try to run this workflow across server to obtain some distributed workload, I could end up with different behaviors depending on what the workflow relies on. Rather than defining server-specific workflows to behave as desired, I would much rather have the option to say response: raw|collection and format: geojson-stac-item (since application/geo+json is insufficient) to make my needs explicit. This kind of diverging behavior was reflected in OSPD, where I basically need 2 distinct "collection-selector" processes, because the interrogated catalogs to do respond the same way. Since handling of collections and field modifiers become part of the execution request, rather than an explicit "collection-selector" sub-process in my workflow, I need to have a way to hint the resolution when it does not auto-resolve correctly.

In 99% of cases, I would expect the pre-filtering and auto-resolution to be the desired and most useful way to go. But, for specific edge case where it doesn't do what is desired, hacking workflows to make them "behave" is tedious, or needs to introduce special logic in my generic processing engine to handle these uncommon cases.