radiantearth / geo-ml-model-catalog

Geospatial ML Model Catalog Spec
Apache License 2.0
52 stars 8 forks source link

Should GMLMC be a STAC extension? #30

Open duckontheweb opened 2 years ago

duckontheweb commented 2 years ago

From @ymoisan in response to #25:

That's quite a thorough assessment @m-mohr :-). One question @duckontheweb had on Slack has to do with whether or not GMLMC should be a STAC extension or not. The number of times you mention STAC in your comment kind of suggests we could consider the implementation of a model catalog to be a STAC extension. Samuel at CRIM came up with a DLM STAC extension (DLM = Deep Learing Model). You think that makes sense ?

I have no strong stance on the subject. I figured since a model catalog will eventually point to model "items" and those items are linked with training data -- the geographical and temporal envelopes of which seem to qualify model items as spatio-temporal assets -- then the catalog could link to STAC (DLM or other extension) model items and therefore be a STAC Catalog. Eager to see what others think.

duckontheweb commented 2 years ago

From @m-mohr in response to above comments:

I was refering to STAC quite a lot because I've co-authored it and it is stable, so there's a lot to learn from it and it's history. That doesn't necessarily imply it should be "merged". I'm not sure whether it should be a STAC extension or not, but aligning the specs always makes sense to potentially re-use code etc. I think I need to dig deeper into this to figure out whether it would make sense to make it an extension. Are models "spato-temporal assets"? Would models be STAC Items, STAC Catalogs or Collections? I'm not sure on both yet and have pros and cons for all options. The biggest advantage would be that there is much tooling out there for STAC and it could help with adoption. But it's indeed a key point for discussion, I think. And also deserves a separate issue...

How does the DLM STAC extension relate to GMLMC? (By the way, this abbreviation is always confusing me...)

duckontheweb commented 2 years ago

From @ymoisan in response to above comments:

Are models "spatio-temporal assets"? (Would models be STAC Items ?)

That's the issue :-). You could think by virtue of being trained with images, which are obviously spatio-temporal assets, then their application domain (when it comes to inference) is probably dependent on the spatio-temporal envelopes of the set of training spatio-temporal assets and therefore models could "inherit" their spatio-temporal properties from their training data set. Or if your model generalizes really well then maybe their spatio-temporal properties only becomes a hint for users.

How does the DLM STAC extension relate to GMLMC?

Since the catalog is an endpoint to (a group) of models, then we need a way to describe those models so we can eventually make queries to an API, e.g. something like

"Here's an image from sensor S of area X at time T; what models could I use to extract buildings and roads from that image?".

Does it make sense that models be described as STAC [DLM] items which are then referenced in catalogs/collections ?

duckontheweb commented 2 years ago

Are models "spatio-temporal assets"? (Would models be STAC Items ?)

That's the issue :-). You could think by virtue of being trained with images, which are obviously spatio-temporal assets, then their application domain (when it comes to inference) is probably dependent on the spatio-temporal envelopes of the set of training spatio-temporal assets and therefore models could "inherit" their spatio-temporal properties from their training data set. Or if your model generalizes really well then maybe their spatio-temporal properties only becomes a hint for users.

I also do not have a strong opinion on this and would love to hear more discussion, but I tend to lean towards not making this a STAC extension. My feeling is that while some of the things associated with a model (e.g. training data) represent spatiotemporal assets, the model itself is not a spatiotemporal asset.

Additionally, there does not seem to be one single, obvious interpretation of the spatial and temporal properties of an ML model represented as an Item. The suggestion here is that it could represent the data on which the model was trained; however, when we posed this question to others they felt that it should represent the area over which the model should be used (which may or may not be the same as the area over which it was trained). It seems like it might be clearer to structure the GMLMC as a separate spec that makes reference to and aligns with STAC, but is not a STAC extension.

The downside of this, of course, is that you lose the ability to just add the model as a STAC Item in your STAC API and allow people to search for it using the typical search queries...

duckontheweb commented 2 years ago

cc: @sfoucher

ymoisan commented 2 years ago

For reference : https://github.com/sfoucher/dlm-extension

sfoucher commented 2 years ago

My feeling is that while some of the things associated with a model (e.g. training data) represent spatiotemporal assets, the model itself is not a spatiotemporal asset.

I beg to differ, let's say your are training a model only based on images acquired in Summer, this model will fail miserably on images acquired in Winter and this information must appear somewhere so that the user is aware of this limitation. In my opinion, a model strongly captures the properties of the underlying training set.

ymoisan commented 2 years ago

From the README, Purpose section

... At a high level, this specification should provide sufficient information to enable search and discovery of geospatial ML models and answer the following questions: ... Is this model applicable to the geographic region I am interested in?

That suggests spatial dependency. So we may consider models are at least SAC. Training a model to detect vegetation will greatly depend on the vegetation cover phenological stage (e.g. leaf off or on). For that particular case, that makes model items true STAC items to me.

Just as STAC Catalogs/Collections are pointing to STAC Items, it seems logical to me for a model catalog to point to model items of sorts. Basically all information pertaining to models would be stored in model items. So it's more than "takes advantage of the STAC specification to describe training data associated with a model" (README again). Eager to see what others think.