stac-extensions / ml-model

An Item and Collection extension to describe machine learning (ML) models that operate on Earth observation data.
Apache License 2.0
37 stars 0 forks source link

revise scope of the ml model extension to focus on model card and inference #15

Closed rbavery closed 9 months ago

rbavery commented 9 months ago

I'm starting a draft to refactor the extension to focus on making model inference easy, portable, and discoverable. This shifts the focus of the extension from serving inference, training, and retraining needs. Hopefully the spec will be simple and concrete enough that many providers and users of GeoML models will adopt it and develop tooling for it to make it easier to find and use models.

The big updates I think are needed:

  1. reducing the scope of the extension to describing and linking to inference requirements and a model card to describe other model characteristics.
  2. add input and output signatures for different modeling tasks to enable models to be run only by inspecting the metadata extension and setting up minimal dependencies (torch, torchvision)
  3. encourage use of compiled model formats (torch.export, onnx) with examples to increase focus on portability for GeoML models

rationale:

I think the extension as it currently stands is difficult to build on top of because it doesn't improve workflows for using ML models (either in training or inference). The spec doesn't enforce standards for how model artifacts need to be represented. Model creators might distribute the model runtime environment in many different ways (requirements.txt, setup.py. pyproject.toml, python package, conda, docker, no documentation), so seeking to capture and enforce this info in the spec seems too complex and like it wouldn't serve a large enough user base. I think it would be more helpful for more people if geo ML models had a detailed model card on performance and limitations, and to optionally point to a github repo with further details on the runtime environment for training.

For example, torchgeo captures model information, but these model weights are only tied to papers and source github repositories. there's no quick description of what dataset it is related to, what format the model artifacts are in, or what kind of model input and output relates to the weights: https://torchgeo.readthedocs.io/en/stable/api/models.html#resnet

I think the biggest "standards gap" is a metadata description for discovery and distribution of models for model users, rather than for model developers looking to train or retrain existing models.

rbavery commented 9 months ago

closing this in favor of https://github.com/crim-ca/dlm-extension

https://github.com/stac-extensions/ml-model/issues/13#issuecomment-1841891875