opengeospatial / ogcapi-processes

https://ogcapi.ogc.org/processes
Other
46 stars 45 forks source link

EO Application packages in openEO #428

Open m-mohr opened 1 month ago

m-mohr commented 1 month ago

We are looking into ways to integrate EO Application packages (CWL) into openEO. Our latest ideas are formulated here: https://github.com/Open-EO/openeo-processes/issues/507#issuecomment-2236310553

To be as close as possible to OGC API - Processes and the OGC Best Practice for Earth Observation Application Package, we have a couple of questions:

  1. In a CWL file from OSPD that @fmigneault provided, I could see types and formats being defined for inputs and outputs, for example type: File[] and format ogc:geotiff.
    • Is there a list of these formats?
    • Is there a format pre-defined for STAC? If not, can we define one?
    • Would the type better be defined as a File, File[] or Directory?
  2. Is there a pre-defined link relation type to link to an OGC API - Processes API (i.e. it's landing page)?
  3. If an EO Application package is deployed through Part 2, is there a way to fell from the OGC API Processs Description whether a process originates from a previously deployed CWL or not?
    • Is there any substantial difference between a pre-deployed CWL and a "regular" OGC API - Processes from a user point of view that works with the OGC API?
  4. Is there a recommended way to name Application Packages? What's the agreed / commonly used terminology? Should we call a UDF runtime in openEO bettter EOAP, CWL, ...? It would sit on the same level as e.g. a Python or R UDF runtime...

PS: Sorry, this might not be the right place to post this issue, but I'm also not sure which place would be better suited.

fmigneault commented 1 month ago

Is there a list of these formats?

The format field uses a $namespace reference pointing to some naming authority. Typically, this will be one of

Therefore, they resolve to some media-type identifier. However, they can be used for other references as well in CWL, not exclusively media-types.

Is there a format pre-defined for STAC? If not, can we define one?

I don't think there is one (haven't looked thoroughly though), but if there isn't, yes we should. In the meantime, the iana:application/geo+json is still valid for STAC Item. If a more specific STAC media-type is defined for https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#media-type-for-stac-item, it could be integrated as well.

Would the type better be defined as a File, File[] or Directory?

I think File makes more sense if referring to a single URI. For example, requesting <stac-api>/collections, <stac-api>/collections/{col} or <stac-api>/collections/{col}/items all return a "single JSON document".

The File[] can make sense if the process takes multiple <stac-api>/collections/{col}/items/{item} references. That could be one way to handle collection: <stac-api>/collections/{col} as input, converted into the corresponding STAC Items array it contains.

For the same reason as above, Directory could work also to resolve a collection. Directory implies a nested structure that will be recursively crawled by CWL, so that could be one way to handle the "local catalog" STAC. However, I have found that Directory often lacks in metadata, because it doesn't describe very well what the children should be (they can be "Any" File/Directory).

Is there a pre-defined link relation type to link to an OGC API - Processes API (i.e. it's landing page)?

Not for the landing page. These are available: https://github.com/opengeospatial/ogcapi-processes/blob/master/core/sections/clause_5_conventions.adoc#link-relations

If an EO Application package is deployed through Part 2, is there a way to fell from the OGC API Processs Description whether a process originates from a previously deployed CWL or not?

Not explicitly. There is deploymentProfile: http://www.opengis.net/profiles/eoc/dockerizedApplication (sometimes deploymentProfileName instead...) for a docker only (https://github.com/opengeospatial/ogcapi-processes/blob/master/extensions/deploy_replace_undeploy/standard/requirements/ogcapppkg/REQ_profile-docker.adoc).

I personally use these additional ones (unofficial):

However, in my case, those are all derived CWL definitions.

I believe that better types should be defined/added. Since CWL can be a docker, workflow, etc. as well, I don't think http://www.opengis.net/profiles/eoc/cwl would be sufficient either. Maybe http://www.opengis.net/profiles/eoc/cwl+docker or similar might be required.

An alternative way to find the Application Package type would be with the content media-type returned by /processes/{processId}/package (https://github.com/opengeospatial/ogcapi-processes/blob/master/extensions/deploy_replace_undeploy/standard/requirements/cwl/package/REQ_response-body.adoc). This is fairly new, so not implemented by many yet.

Is there any substantial difference between a pre-deployed CWL and a "regular" OGC API - Processes from a user point of view that works with the OGC API?

Once deployed, the user should technically never need to look at the Application Package. All interactions should be through OAP, and there should be no distinction between a Core or a dynamically deployed process.

Is there a recommended way to name Application Packages? What's the agreed / commonly used terminology? Should we call a UDF runtime in openEO bettter EOAP, CWL, ...? It would sit on the same level as e.g. a Python or R UDF runtime...

Not sure if I can answer completely regarding openEO, but there are at least 2 important mentions regarding naming:

  1. "OGC Application Packages" has its own schema and media-type application/ogcapppkg+json. It is something on its own. It can be submitted with various by-value/reference combinations (https://github.com/opengeospatial/ogcapi-processes/blob/master/openapi/schemas/processes-dru/ogcapppkg.yaml), but the important "schema" of what it looks like is: https://github.com/opengeospatial/ogcapi-processes/blob/master/openapi/schemas/processes-dru/executionUnit.yaml

    This must be distinguished from generic "Application Package", which is anything that can be deployed/translated into an OGC API - Processes: Core, including application/cwl, application/cwl+json, application/cwl+yaml and any other alternate "package/runtime" representation (such as openEO also for that matter).

  2. A CWL can embed a Python, R, JavaScript, etc. runtime as CommandLineTool. I think it is up to openEO to decide how to distinguish a "native" UDF runtime vs one wrapped in CWL, but it is something that should be indicated "somewhere/somehow" to avoid ambiguity.

m-mohr commented 1 month ago

Thank you, Francis.

I don't think there is one (haven't looked thoroughly though), but if there isn't, yes we should.

So could we define a namespace for STAC (URL tbd) and just create:

If required, they could resolve to the mentioned media types, but they are pretty ambiguous so maybe it's better to not do it? It's unlikely that we'll add specific media types for STAC anytime soon. Just providing application/json in case you expect a STAC collection is not very helpful in the schema.

I think File makes more sense if referring to a single URI.

Agreed.

For the same reason as above, Directory could work also to resolve a collection.

A collection is also a file. If you specify just a folder the entry point is not clear. So I agree that Directory is not ideal and File it preferrable.

Not for the landing page. These are available: https://github.com/opengeospatial/ogcapi-processes/blob/master/core/sections/clause_5_conventions.adoc#link-relations

Could we define one? Or should we define our own? As we need to read the conformance classes it's not enough to link to the processes.

Once deployed, the user should technically never need to look at the Application Package. All interactions should be through OAP, and there should be no distinction between a Core or a dynamically deployed process.

So do I assume correctly that there's something in place in servers that converts a CWL definition to a OGC API - Process Definition? They are somewhat different, aren't they?

1. ["OGC Application Packages"](https://github.com/opengeospatial/ogcapi-processes/blob/master/extensions/deploy_replace_undeploy/standard/requirements/cwl/package/REQ_response-body.adoc#rc_ogcapppkg) has its own schema and media-type `application/ogcapppkg+json`. It is something on its own.

[...]

  1. A CWL can embed a Python, R, JavaScript, etc. runtime as CommandLineTool.

I need to dig into this more, I have to learn more about how to ditinguish them and how they differ. But is pure CWL that is not an OGC Application Packages really relevant here? Would it be valid to assume that every CWL that you deploy via OGC API - Processes should be an OGC Application Package?

I think it is up to openEO to decide how to distinguish a "native" UDF runtime vs one wrapped in CWL, but it is something that should be indicated "somewhere/somehow" to avoid ambiguity.

A UDF runtime wouldn't really be wrapped in CWL, CWL is a runtime on its own. Each runtime has an id which the users needs to choose from, so I guess there's no issue with ambiguity.

fmigneault commented 1 month ago

define a namespace for STAC

Yes. That would be great! I also suggest adding those definitions (stac-item, stac-collection, etc.) to https://github.com/opengeospatial/ogcapi-processes/issues/395#issuecomment-2243014072.

This way, an OGC input

input:
  schema:
    type: string
    format: stac-item
    contentMediaType: application/geo+json
    contentSchema: "https://geojson.org/schema/GeoJSON.json"

could easily be mapped with CWL as:

input:
  type: File
  format: "stac:item"
$namespaces:
  stac: "https://<STAC-URI-TBD>/"

[... landing page ...] Could we define one?

Yes. Maybe open a separate issue? There is a corresponding concept in Part 3 (Landing Page Response), so it should be a relevant addition.

there's something in place in servers that converts a CWL definition to a OGC API - Process Definition

In CRIM's implementation, yes. OAP I/O are converted to/from CWL. This is done here: https://github.com/crim-ca/weaver/blob/master/weaver/processes/convert.py

That conversion applies especially in the case where application/cwl+json is directly sent to POST /processes. However, an implementation could also use the application/ogcapppkg+json with executionUnit construct, and explicitly provide the OAP I/O to avoid the need to convert. After that, whether the CWL definition "aligns" with the OAP I/O schema depends on how the transition is done in the backend.

But is pure CWL that is not an OGC Application Packages really relevant here? Would it be valid to assume that every CWL that you deploy via OGC API - Processes should be an OGC Application Package?

I'm not sure if I got 100% the question, but I think "yes"? (what would be the pure CWL/non OGC AppPkg?)

In my case (at least), whether I deploy a CWL class: CommandLineTool or class: Workflow, it results in an OGC API Process with the Application Package being that CWL. Whether it was an "atomic" operation (eg: some docker process, python script, etc.) or a complicated workflow does not matter. The result is the single OGC API Process containing it. The only tweak I add to class: Workflow is that, if any step indicates run: SomeProcess, it must exist beforehand. However, this is done only to allow reusing SomeProcess in other workflows without having to redefine it each time. It is not a requirement, and the CWL Workflow could contain all of its nested step processing as a whole / on its own.

m-mohr commented 1 month ago

Does the $namespace need to resolve to something specific? Otherwise, we could probably just use https://stacspec.org as URI.

And the it defined the following subtypes:

Would a union type also make sense, i.e. you can ready any STAC input?

I don't quite follow your example. Is that a URL style input or is it providing the STAC inline as JSON string?

Yes. Maybe open a separate issue?

Done: #433

That conversion applies especially in the case where application/cwl+json is directly sent to POST /processes. However, an implementation could also use the application/ogcapppkg+json with executionUnit construct, and explicitly provide the OAP I/O to avoid the need to convert. After that, whether the CWL definition "aligns" with the OAP I/O schema depends on how the transition is done in the backend.

That sounds rather bad for interoperability. Then all clients need to implement the conversion, too....

I'm not sure if I got 100% the question, but I think "yes"? (what would be the pure CWL/non OGC AppPkg?)

Not sure, that's why I'm asking. :-) Can I like google any CWL file made by a random person that is not even aware of OGC or EOAP and deploy it via OGC API - Processes? Is that relevant? Or should we restrict to EOAP?

fmigneault commented 1 month ago

I don't think it is mandatory to resolve, but preferably it does. It can point to any ontology or definition, as long as it doesn't change too much over time. It helps if it resolves to something tangible, like a definition document, when extended to their full from URI.

For the union, usually that would be a type listing all options. For example:

inputs:
  input:
    type: File
    format: 
      - stac:item
      - stac:collection
      - stac:catalog

However, something to be aware of, the output format cannot be a list, only the input. Output must be an explicit single format. CWL's logic regarding this is that the output for the workflow should be well-defined, and therefore not allow mismatching types when executing (eg: process1 cannot suddenly produce an unsupported type by process2 if it was pre-validated for a specific output format). See https://github.com/common-workflow-language/common-workflow-language/issues/901

In that case, stac:any could be a way to handle it. However, is there any advantage of doing so over iana:application/json? CWL's format intention is to be explicit (as if the media-type + JSON-schema were both provided), so it slightly looses its meaning if "any" definitions start to be defined.

That sounds rather bad for interoperability. Then all clients need to implement the conversion.

Conversion of the schema definition of the I/O is not required if using the application/ogcapppkg+json. In that case, you can provide the OAP I/O schema the usual way in processDescription, and the CWL document as is in executionUnit, and just make sure the I/O names match for runtime. During execution, the literal values are passed directly, while File/Directory types are placed in some supported storage and referenced by URI to the CWL runner.

Can I like google any CWL file made by a random person that is not even aware of OGC or EOAP and deploy it via OGC API - Processes? Is that relevant? Or should we restrict to EOAP?

Yes, you can deploy any CWL. Whether you allow certain capabilities (eg: specific combinations of CWL class, requirements and hints definitions) is up to your server.