opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes
Other
48 stars 45 forks source link

Part 4: Clarify /req/job-management/definition-get-op #455

Open m-mohr opened 1 month ago

m-mohr commented 1 month ago

What is the definition meant to return? The process? (e.g. CWL or openEO UDP?)

We include that directly in GET /jobs/:id - I guess that's fine and we can add an additional endpoint, but maybe this is more an optional endpoint for cases where a definition can't be embedded? And it can be explored via a link in GET /jobs/:id?

What is POST /jobs/:id?

_Originally posted by @m-mohr in https://github.com/opengeospatial/ogcapi-processes/pull/437#discussion_r1789875331_

gfenoy commented 1 month ago

The idea was initially to return the definition used to instantiate the job ("status": "created") with additional metadata to describe the mode used (sync/async), for example. Still, I think it would be better to store it in the process attribute (see #450), which would then be added to the statusInfo.yaml schema, and we can remove this endpoint.

What do you think?

m-mohr commented 1 month ago

I agree. Default should be the process property. We can define an optional link relation type for processes that can't be represented in JSON, but without a specific endpoint.

fmigneault commented 1 month ago

Because jobs are submitted using a process: URI property, I wouldn't be surprised that some servers were already embedding that process URI in the job status response. Therefore, the property added to statusInfo.yaml should not only consider the embedded JSON representation, but the reference directly as well (using oneOf).

m-mohr commented 1 month ago

OAP jobs might be, openEO no. There it's process: object. And what is if you submit MOAW or CWL? That's not necessarily URI either, right?

fmigneault commented 1 month ago

Exactly. This is why a separate endpoint (or the same using another Accept header) is proposed. This way, we don't need to depend on a specific embedding of process within the job (though I'm not against having it embedded if the server provides it). Just need to allow the flexibility of what process contains.

m-mohr commented 1 month ago

Yeah, that's what I proposed. That doesn't necessarily need a separate pre-defined endpoint though. Not sure whether we agree or disagree right now 😅

fmigneault commented 1 month ago

If process: "https://.../processes/{processId}" can be returned in the job status, and that endpoint supports Accept header to negotiate any application/cwl+json, application/ogcapppkg+json, etc. handled by the server, then yes, GET /jobs/{jobId}/definition is redundant.

fmigneault commented 1 month ago

If the job contains a non-deployed workflow (such as OAP Part 3 Nested Processes), then there is no such thing as a reference "https://.../processes/{processId}", since the workflow is an ad-hoc definition.

In that case, it MUST be embedded in the job status as process: {ad-hoc workflow}.

So yeah, we agree if the JSON schema allows both variants.

gfenoy commented 1 month ago

I had the following in mind (only illustrated for OGC API - Processes).

For the sample creation request below:

POST /jobs
--
Headers
--
Prefer: respond-async;return=representation
Content-Type: application/json
Content-Schema: https://raw.githubusercontent.com/opengeospatial/ogcapi-processes/refs/heads/master/openapi/schemas/processes-workflows/execute-workflows.yaml
--

{
  "process": "/processes/AA/execution",
  "inputs": {
    "stac_items": [
      "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/S2B_10TFK_20210713_0_L2A",
      "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/S2A_10TFK_20220524_0_L2A"
    ],
    "aoi": "-121.399,39.834,-120.74,40.472",
    "epsg": "EPSG:4326",
    "bands": [
      "green",
      "nir"
    ]
  }
}

The expected response would look like:

{
  "id": "af419f90-97ab-11ef-81ac-0e6063d70ef5",
  "type": "process",
  "processID": "AA",
  "created": "2024-10-31T17:15:10.912Z",
  "status": "created",
  "message": "ZOO-Kernel created your job",
  "process": {
      "preferences": "respond-async;return=representation",
      "schema": "https://raw.githubusercontent.com/opengeospatial/ogcapi-processes/refs/heads/master/openapi/schemas/processes-workflows/execute-workflows.yaml",
      "process": "/processes/AA/execution",
      "inputs": {
        "stac_items": [
          "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/XX",
          "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/YY"
        ],
        "aoi": "-121.399,39.834,-120.74,40.472",
        "epsg": "EPSG:4326",
        "bands": [
            "green",
            "nir"
        ]
      }
  },
  "links": [
    {
      "title": "Execute endpoint",
      "rel": "http://www.opengis.net/def/rel/ogc/1.0/execute",
      "type": "application/json",
      "href": "https://server/ogc-api/jobs/af419f90-97ab-11ef-81ac-0e6063d70ef5/results"
    },
    {
      "title": "Job Management endpoints",
      "rel": "http://www.opengis.net/def/rel/ogc/4.0/job-management",
      "type": "application/json",
      "href": "https://server/ogc-api/jobs/af419f90-97ab-11ef-81ac-0e6063d70ef5"
    }
  ]
}

We can remove the /jobs/{jobId}/definition as we have the definition in the process object.

Also, rather than adding the header properties directly in the process object, it would probably be easier to use a dedicated headers object added to the process object. This object can then contain whatever sounds relevant to send to an execute endpoint to get it to behave as expected.

  "process": {
      "header": {
         "Content-Type": "application/json",
         "Content-Schema": "https://raw.githubusercontent.com/opengeospatial/ogcapi-processes/refs/heads/master/openapi/schemas/processes-workflows/execute-workflows.yaml",
         "Prefer": "respond-async;return=representation"
      }
      "process": "/processes/AA/execution",
      "inputs": {
        "stac_items": [
          "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/XX",
          "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/YY"
        ],
        "aoi": "-121.399,39.834,-120.74,40.472",
        "epsg": "EPSG:4326",
        "bands": [
            "green",
            "nir"
        ]
      }
  },

It looks very similar to what we used in "Table 43 — Parts of InputReference data structure" of WPS 1.0.0 (OGC 05-007r7). It is mentioned the following definition for the <Header> node:

Extra HTTP request headers needed by the service identified in ../Reference/@href. For example, an HTTP SOAP request requires a SOAPAction header. This permits the creation of a complete and valid POST request.

At that time, we also used the <Body> Node for embedding the request body.

Re-using the <Body> node concept would look like this:

  "process": {
      "header": {
         "Content-Type": "application/json",
         "Content-Schema": "https://raw.githubusercontent.com/opengeospatial/ogcapi-processes/refs/heads/master/openapi/schemas/processes-workflows/execute-workflows.yaml",
         "Prefer": "respond-async;return=representation"
      }
      "body": {
        "process": "/processes/AA/execution",
          "inputs": {
            "stac_items": [
              "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/XX",
              "https://earth-search.aws.element84.com/v0/collections/sentinel-s2-l2a-cogs/items/YY"
            ],
            "aoi": "-121.399,39.834,-120.74,40.472",
            "epsg": "EPSG:4326",
            "bands": [
                "green",
                "nir"
            ]
          }
     }
  },

This <Header> node was lost in WPS 2.0.

fmigneault commented 1 month ago

I wouldn't include the schema, preferences, etc. headers under process. This is what the /jobs/{jobId}/inputs should return under headers (https://github.com/opengeospatial/ogcapi-processes/blob/master/openapi/schemas/processes-job-management/inputs.yaml) [note: headers what not added to this schema, but it was suggested previously].

This example also illustrates why process is somewhat ambiguous. In this case, it is not really a "process" (description) per say, but a job-execution content. I believe the process should point to an actual process (i.e.: the nested process URI in that case). Using a field named process that could contain a mix of process-description or job-execution makes it very hard to interpret.

Embedding a job-execution content under process is also very ugly if the contents are not JSON.

m-mohr commented 1 month ago

I don't really understand the issue here. My proposal was:

We have an optional "process" property of type object in the job description object and whenever it's a json object you can embed it.

It's optional though and if the process is NOT an object, you just add a link to the job description object, like:

  "links": [
    {
      "title": "Process definition",
      "rel": "process-definition", # just an example, we can change that to another rel type
      "type": "application/yaml",
      "href": "./my-process.cwl" # could be at /jobs/:id/definition, but also somewhere else in principle, e.g. /processes/:id or so - We don't need to pre-define an endpoint, just follow the link
    },
    ...

I think that should be able to capture all cases. I haven't seen any reason yet why this wouldn't work. Any thoughts?

fmigneault commented 4 weeks ago

The link is fine.

I'm not really fond of process containing something that is not an OGC API - Process description or a URI pointing to one. It is very confusing when the same word refers to different kind of contents within the same API. Even in the context of openEO, shouldn't it be a process_graph or similar, and not just "process"?

m-mohr commented 4 weeks ago

No, process graphs alone are worthless, the process has additional metadata that may be needed in addition to the graph. We can't prevent that the process will be something different.

fmigneault commented 4 weeks ago

So, if I understand correctly, openEO's definition is something along the lines of "process-graph + configs = process" ? In OAP, we have "process + inputs/headers -> job definition".

If my interpretation is correct, I can understand openEO's use of process, but this is an important clash in terminology for OGC API - Processes. If we use process to describe something that is not a "process" reference in the typical way it is used to submit jobs, we create confusion in the standard and understanding of the responses.

m-mohr commented 3 weeks ago

No, it's more the execution graph (which includes inputs/input references) and the process metadata. They are one unit, similar to CWL, I think. The thing is, in openEO processes that a server define and processes that a user define share the same schema and as such are both processes. Processes are pretty self-contained, i.e. there are no separate inputs/headers although there might be other related entities such as jobs, which have additional "config" such as title, plan and environment config (e.g. memory, cpu). Not 100% sure what you mean by config.

Example: Server provides (pre-defined) processes add, divide, multiply and subtract. A users chains that to a custom process that's called NDVI and submit's it as (user-defined) process for execution. API docs: https://openeo.org/documentation/1.0/developers/api/reference.html#section/Processes Example from the Python perspective: https://open-eo.github.io/openeo-python-client/udp.html

fmigneault commented 3 weeks ago

The thing is, in openEO processes that a server define and processes that a user define share the same schema and as such are both processes.

That's good. It is the same in OAP.

Not 100% sure what you mean by config.

I meant exactly what you mentioned, such as job title/plan/environment that slightly affects the process. The process itself is mostly agnostic to this "config", but could be affected by them (eg: number of CPUs affected will impact processing speed, or maybe parallelization).


All in all, to my understanding, OAP and openEO both have similar behavior. Some form of "execution graph" is populated by actual inputs references (submitted by the user) and relevant processes. Therefore, this is exactly why I feel job using a process field containing that information is misleading (IF it contains the job input values), since this is not a "process" per se (neither in OAP nor openEO), but the entire "execution graph" that employs specific inputs and one-or-more processes, whatever those processes embed (server-defined, CWL, a docker, etc.).

CWL does NOT correspond to that "execution graph" either. It is at workflow definition (how the inputs/output should be chained), but the effective inputs submitted with the job input values are not yet specified at that point. Therefore, embedding the CWL in process WOULD be a process representation. If the "user-defined" openEO process corresponds to this as well (without job input values), then we agree on the process contents. This is not respected in the case of Part 3 Nested Processes that do include the job input values.

m-mohr commented 3 weeks ago

Somewhat, but it's not quite as in OAP. In openEO the input values are part of the process execution graph, there are no separate input values which you could submit. And if your process has parameters, you need to encode them in another process where the inputs again are part of the process execution graph. ;-) Might a bit confusing for you without a concrete example, I guess?

The config is not part of that. That's part of the job, e.g.

Job:

Process (this is one atomic unit and shall never be split into pieces):

If you define something like the following, you can only store it as UDP, not execute it as job.

Process (this is one atomic unit and shall never be split into pieces):

To execute it, you'd again have somethine like

(Disclaimer: Simplified example)

fmigneault commented 3 weeks ago

These descriptions are clear.

What I'm still not sure is inside the /jobs/{jobId} response, which one of these is going to be contained in process?

If it is similar to:

{
   "id": "{jobId}",
   "status": "running",
   "process": { 
     "my_other_process": {
       "process_id": "my_process",
       "arguments": { "X": -5 },
       "result": true
     }
   }
}

This is what I find "misleading", since process would contain an execution "graph" (combining my_process with its specific X=-5 input). So, why not simply call the field graph and avoid the confusion with the overused term process?

If, instead, the job status response process contains only the UDP my_process description with its absolute(add(X, 2) definition, parameters: X, etc., then I have no issue with using process. It is equivalent to having a deployed process that would contain that definition, and referring to it by URI.

m-mohr commented 3 weeks ago

It would contain the absolute(add(5, 2) usually. parameters in openEO just contains the schemas, the values are part of the graph. Always fully resolved for the job.

Example from the API spec: https://api.openeo.org/#tag/Data-Processing/operation/describe-job

Might be a good idea to hop on a short call to clarify this in all details, I feel like in text it's much more difficult to get to the details comparet to going through some example in a screenshare...