Open rbavery opened 3 days ago
At first glance, it seems a new property similar to ml-aoi:split
and ml-aoi:role
could be used to describe "applied uses". I think the two properties already available serve specific goals, so I wouldn't extend their enums. I wonder if such "applied uses" even make sense to be defined here, since a dataset cannot enforce a use for all models. It should instead be the model (i.e.L MLM) that indicates how it applied a given ML-AOI collection.
That being said, I am not quite sure how search of MLM would be improved with ml-aoi
in this case. MLM should definitely include links with ml-aoi:split
+rel: derived_from
to provide source datasets, but those cannot be searched or filtered in STAC API. They are just provided as additional metadata references. This raises a higher core requirement of STAC to enable search and filtering of links, which is not supported.
In the MLM each model has a geospatial footprint. right now it is pretty loose what this represents, so I expect this to be confusing to use for search and discovery of models.
For example, someone might want to search for models that have been validated/tested in their area of interest. some might be interested in searching for models that have been trained in their area of interest.
This spec does a better job at describing the meaning of AOI geographies that can be associated to models
Could we update it to account for different dataset relation types a dataset can have with a model? I think these include
pre-training - a training set that is used in self supervised learning (ex. masked autoencoder) where labels are not needed
supervised finetuning - a training set with supervised labels
post-training/instruction-tuning - a training set used to shift model responses to a certain style, format, safety requirements, etc. Goal is not to add new knowledge to the model but improve it's responses given a certain input modality like natural language or dropping points (SAM)
validation test
This isn't a near term priority for me but I think it will be important for describing more complex data-model relationships for some models like Clay or SAM2