stac-extensions / ml-model

An Item and Collection extension to describe machine learning (ML) models that operate on Earth observation data.
Apache License 2.0
37 stars 0 forks source link

Add inference metadata [WIP] #3

Closed duckontheweb closed 2 years ago

duckontheweb commented 2 years ago

This PR is a work in progress to add metadata describing how a user would run the model to generate inferences.

For now, I've focused only on supporting Docker images for inferencing since it simplifies the problem a bit. The extension defines an additional Asset role (ml-model:inferencing-image) that can be used to refer to a Docker image to be used to run the model to generate inferences. Assets with this role are then extended with additional properties to describe how you would run a container based on this image to generate inferences. Some of these fields are loosely based on the DLM Extension by @sfoucher.

To Do:

Open Questions:

cc: @ymoisan @HamedAlemo

m-mohr commented 2 years ago

I have not much experience with ML and Docker, so may not be the best reviewer here, but I'm wondering whether it would make sense to link to another file (Dockerfile?) instead of putting all this filesystem/Docker related stuff into STAC?

duckontheweb commented 2 years ago

I have not much experience with ML and Docker, so may not be the best reviewer here, but I'm wondering whether it would make sense to link to another file (Dockerfile?) instead of putting all this filesystem/Docker related stuff into STAC?

I like the idea of using an existing format/spec instead of defining all of this in STAC. I don't think the Dockerfile will capture what we want here, because that would require users to build the image themselves. In that case they would need to have the Docker build context, build arguments, etc. in order to successfully build the image, and they would still need all of the metadata defined here in order to actually run the container.

Docker open-sourced the Docker Compose file format as the Compose spec, which should capture most of the information we need. Maybe having an inferencing Asset that points to a Compose file is preferable to the current setup (this also gets around the question of how to describe the media type of a Docker image). If we went this route, we probably still want to keep the ml-model:input-volume and ml-model:output-volume fields and have them refer to named data volumes so that users know where to put their input data and how to retrieve output data.

ymoisan commented 2 years ago

I don't think the Dockerfile will capture what we want here, because that would require users to build the image themselves. In that case they would need to have the Docker build context, build arguments, etc.

Agreed. One thing I think we need from an operational point of view are actual images that can be run on-demand. Plus, I try to not manage Dockerfiles but rather use some tools like repo2docker (or s2i that I think I would prefer since I don't need a notebook at the end of the process) to "go from a repo to a docker image". It's not perfect because among other things there is apparently no way to specify an alternate base image to build on top of (currently a stock Ubuntu 18 (20).04) so you need to hardcode e.g. an nvidia Ubuntu image for the specific version of CUDA-cuDNN you use, but I can do a lot of things just not to maintain a Dockerfile.

assets vs links

My layman assessment (please chime in @m-mohr ) would be that an asset is something that belongs to an item. The item at hand here is the model so I would be tempted to reserve assets for model artifacts, like a pth or onnx file. Runtimes chosen by implementors are their choice and so I would use links to those runtimes (docker images and eventually other types of runtimes like Singularity) that we may want to consider as "tested" or "example" runtimes.

mime type

The picture seems quite blurry to me at this point. Looks what you've chosen is the current baseline but I have not come across a formal IETF mimetype for a container image like there exists one for GeoPackages.

duckontheweb commented 2 years ago

assets vs links

My layman assessment (please chime in @m-mohr ) would be that an asset is something that belongs to an item. The item at hand here is the model so I would be tempted to reserve assets for model artifacts, like a pth or onnx file. Runtimes chosen by implementors are their choice and so I would use links to those runtimes (docker images and eventually other types of runtimes like Singularity) that we may want to consider as "tested" or "example" runtimes.

Good points, thanks @ymoisan. The Asset Object definition defines assets as "data associated with the Item that can be downloaded or streamed", where the Link Object section describes those as describing "a relationship with another entity." I think I tend to agree with you that a Docker image might be better represented as a Link, but if we decide to use Compose files to describe inferencing runtimes those might be better represented as Assets. We can have multiple Assets for a given model, so I believe it would be fine to have Assets for a Compose file, an ONNX file, and a .pth file all in the same Item. We would just want to have distinct roles for each of those (so maybe ml-model:inferencing is not specific enough in this case).

m-mohr commented 2 years ago

I didn't have much time yet to go through your posts in detail, but it sounds like you are on the right track.

In general, my only intention here was to say that if there's an external way to provide the required information in a somewhat standardized way, we should use that instead of defining our own "proprietary" fields for it. I'm not very fluent with Docker so the dockerfile example was obviously wrong, but it seems @duckontheweb has identified a potential alternative, which is nice.

The issue with media types (MIME types are a thing of the past according to IANA) is present more often than it should and as such you may need to invent your own, unfortunately. We did the same with COGs, but alternatively you could also somewhat tie a relation type (link) or role (asset) to specific sorts of files (e.g. say that rel "cat-content" should always link to an animated GIF showing a cat ;-) )

duckontheweb commented 2 years ago

cc: @ mpelchat04

guidorice commented 2 years ago

Docker-compose bind volumes work great and are super convenient (but support varies by host platform). On Linux my limited experience is the bind volume's filesystem ownership is a tricky subject and it's always not super obvious how to get the container and the host to agree on who owns the files in the volume mount. On Mac, however, it works pretty seamlessly. Not sure about Windows.

fmigneault commented 2 years ago

(For anyone not knowing who I am, I'm also a participant for DLM Extension development)

I like the idea of using an existing format/spec instead of defining all of this in STAC.

Runtimes chosen by implementors are their choice and so I would use links to those runtimes (docker images and eventually other types of runtimes like Singularity)

In general, my only intention here was to say that if there's an external way to provide the required information in a somewhat standardized way, we should use that instead of defining our own "proprietary" fields for it.

I agree with above points. STAC should definitely leave it up to other specifications to define the runtime environment to simplify definitions (let's not reinvent the wheel).

I would argue that Docker, Singularity and Docker-Compose are still too generic/insufficient definitions to properly provide all parameters, namely for inputs and outputs definitions. The example compose in the README provides volumes mount point for input/output passed to a run-script, but this is still assuming only those details are necessary. Some scripts could require more options, like --cpu vs --gpu runtime, or an additional config file, etc. There are considerations such as argument position, prefixes, etc. and more runtime details such as devices, RAM, etc. that are important to include, and is specific to each script/application, something that cannot be inferred directly by reading the command/entrypoint field of a compose YAML.

I believe a more specific specification, such as Common Workflow Language (CWL) (or other equivalents) would be more appropriate. This allows more flexibility as for the following definition (complex anti-spoofing model under the hood).

cwlVersion: v1.0
class: CommandLineTool
requirements:
  DockerRequirement:
    dockerPull: registry.gitlab.com/crim.ca/patrimoines/anti-spoofing/anti-spoofing-e2e:0.1.0
arguments:
  - "--working-dir"
  - "/tmp/kaldi_intermediate_files"
  - "--out-path"
  - "$(runtime.outdir)/out.txt"
inputs:
  - type: File
    inputBinding:
      position: 1
      prefix: "--data-path"
    format: audio/wav
    id: audio_speech_file
outputs:
  - type: File
    outputBinding:
      glob: "$(runtime.outdir)/*.txt"
    format: text/plain
    id: scores

Note that further requirements can be guaranteed than the simple data volumes this way, such as specific Media-Type for given inputs, as well as explicit command definition.

I'm not saying STAC should use CWL directly, but similar constrains must be considered. Most definitely, STAC should allow some versatility over the reference specification, whether a Compose file or "something else".

sfoucher commented 2 years ago

I think this is a good update, pointing to a docker-compose .yml file is a good approach and avoids having to insert too much information into the item which are not relevant to a search. I still think we need a line that describes how to use the model properly and allow the user to verify that the model is working properly, kind of like the equivalent of a checksum on an archive. We can maybe add optional links to an input sample and an output sample, the user can then verify that the instantiated model is producing the same output.

duckontheweb commented 2 years ago

I believe a more specific specification, such as Common Workflow Language (CWL) (or other equivalents) would be more appropriate. This allows more flexibility as for the following definition (complex anti-spoofing model under the hood). I'm not saying STAC should use CWL directly, but similar constrains must be considered. Most definitely, STAC should allow some versatility over the reference specification, whether a Compose file or "something else".

I agree that CWL does seem like a better fit for what we are trying to describe here, although I have to admit I'm not as familiar with it as I am with Docker Compose files. Should we also consider Workflow Description Language (WDL) as an option (again, I'm not very familiar with this standard in practice)?

I think that having some versatility in the required format will make it more likely that model publishers will use this extension, but we also don't want to make the situation too complicated for someone building client software around this spec by requiring that they support many different runtimes frameworks. Supporting 1 or 2 formats that are commonly used and meet our needs but be a good place to start.

fmigneault commented 2 years ago

Workflow Description Language (WDL) as an option (again, I'm not very familiar with this standard in practice)?

Yes, this is another equivalent and viable solution.

duckontheweb commented 2 years ago

It seems like there is agreement that both Common Workflow Language (CWL) and Workflow Description Language (WDL) would meet our needs, and Compose files may be able to meet the needs of at least some use-cases.

In the interest of continuing to move this forward, I am going merge this PR with support for Compose files, but I will also open separate PRs to add support for CWL and WDL so that we can add any best practices, examples, and/or additional metadata to support those formats.