Unpack "training" to account for different kinds of training roles

stac-extensions / ml-aoi

An Item and Collection extension to provide labeled training data for machine learning models.

Apache License 2.0

14 stars 1 forks source link

In the MLM each model has a geospatial footprint. right now it is pretty loose what this represents, so I expect this to be confusing to use for search and discovery of models.

For example, someone might want to search for models that have been validated/tested in their area of interest. some might be interested in searching for models that have been trained in their area of interest.

This spec does a better job at describing the meaning of AOI geographies that can be associated to models

Could we update it to account for different dataset relation types a dataset can have with a model? I think these include

pre-training - a training set that is used in self supervised learning (ex. masked autoencoder) where labels are not needed
supervised finetuning - a training set with supervised labels
post-training/instruction-tuning - a training set used to shift model responses to a certain style, format, safety requirements, etc. Goal is not to add new knowledge to the model but improve it's responses given a certain input modality like natural language or dropping points (SAM)

validation test

This isn't a near term priority for me but I think it will be important for describing more complex data-model relationships for some models like Clay or SAM2

At first glance, it seems a new property similar to ml-aoi:split and ml-aoi:role could be used to describe "applied uses". I think the two properties already available serve specific goals, so I wouldn't extend their enums. I wonder if such "applied uses" even make sense to be defined here, since a dataset cannot enforce a use for all models. It should instead be the model (i.e.L MLM) that indicates how it applied a given ML-AOI collection.

That being said, I am not quite sure how search of MLM would be improved with ml-aoi in this case. MLM should definitely include links with ml-aoi:split+rel: derived_from to provide source datasets, but those cannot be searched or filtered in STAC API. They are just provided as additional metadata references. This raises a higher core requirement of STAC to enable search and filtering of links, which is not supported.

stac-extensions / ml-aoi

Unpack "training" to account for different kinds of training roles #13