radiantearth / geo-ml-model-catalog

Geospatial ML Model Catalog Spec
Apache License 2.0
52 stars 8 forks source link

First impressions #25

Open m-mohr opened 3 years ago

m-mohr commented 3 years ago

I've read the spec for the first time and just would like to capture the first thoughts:

  1. version: Proposal to rename to gmlmc_version. Once you mix different specifications into a single one it helps to avoid conflicts.
  2. Why do two fields have a model_ prefix and the other don't have it. Seems a bit inconsistent. I see a justification for model_type if another "type" gets introduced. I could imaging that a general "type": "Model" could be nice to distinguish files with similar semantics, e.g. as in STAC.
  3. In STAC you can provide links, which I think is useful for this kind of data too. Would suggest to add the same link structure as in STAC. For example, it would be useful for more details about the license field (e.g. if set to properietary or various).
  4. Do I understand correctly that model_type needs to be a string listed in the section Model Type? How can three be required at the same time if it's a single string that is allowed?
  5. contacts has email required. I'd go the STAC way here to allow (and require?) a URL to be able to link to a more detailed (contac?) page where you usually can get more details. Requiring e-mail could is not so nice because some people may not like to provide one in an easily machine readable way to avoid spamming etc. Why is organization different from name? Usually it seems a general name property is enough to fullfill both personal and organizational contributions. Also, names (and emails) get quickly outdated with people switching companies etc. Roles from STAC could also be useful (host, creator, licensor, etc). So proposal is name (required), url (maybe required), roles (maybe required) and email (not required).
  6. Now that https://citation-file-format.github.io/ is available, I'm wondering whether we should not invent our own citation field and instead link to a cff file. (This also applies to STAC, where I'm wondering whether to deprecate the sci extension).
  7. The fields for "The data type of the parameter" should be an enum of values like data_type in STACs raster extension. Otherwise you get a lot of different values for the same data types.
  8. programming_language in Environment should try to also have a list of values pre-defined. In openEO we recommend this list: https://www.scriptol.com/programming/list-programming-languages.php
  9. dependencies as a string seems hard to parse. I think a more machine-readable way would be good. Another idea would be to link to common dependency files like package.json in JS, composer.json in PHP or requirements.txt(?) in Python
  10. A indicative list of values for the OS would be good, too.
  11. Wondering if number_of_cores should provide a minimum value and a recommended value?
  12. In other specs it was useful to allow CommonMark for all description fields.
  13. It's a bit confusing that the Model Runtime spec says "The spec currently only supports models serialized in the ONNX format.", but then there's also Docker?
  14. The Model Runtimes use link as a field for URLs. I'd make it url throughout the spec for consistency (or allow links link in STAC).
  15. code_examples: aligning with STAC would be nice: https://github.com/radiantearth/stac-spec/issues/992 (and eventually write an "extension" for it in STAC, too).
  16. Model Runtime Fields: If type already needs to be set to docker or onnx, why provide a format field with the same values in properties?
  17. Usage Conditions has only spatial and temporal extents, but if you work on multi-dimensional data cubes that's probably not enough: Would it make sense here to re-use a modified version of the data cube extension?
duckontheweb commented 3 years ago

Thanks @m-mohr , this is all really good feedback. I think I will break most of these out into separate issues so we can have more targeted discussions around them and start getting some PRs in to fix/change things.

ymoisan commented 3 years ago

That's quite a thorough assessment @m-mohr :-). One question @duckontheweb had on Slack has to do with whether or not GMLMC should be a STAC extension or not. The number of times you mention STAC in your comment kind of suggests we could consider the implementation of a model catalog to be a STAC extension. Samuel at CRIM came up with a DLM STAC extension (DLM = Deep Learing Model). You think that makes sense ?

I have no strong stance on the subject. I figured since a model catalog will eventually point to model "items" and those items are linked with training data -- the geographical and temporal envelopes of which seem to qualify model items as spatio-temporal assets -- then the catalog could link to STAC (DLM or other extension) model items and therefore be a STAC Catalog. Eager to see what others think.

m-mohr commented 3 years ago

Yeah, I wasn't sure whether I should "spam" you with 17 issues or put it just in one. Sorry for the inconvenience.

I was refering to STAC quite a lot because I've co-authored it and it is stable, so there's a lot to learn from it and it's history. That doesn't necessarily imply it should be "merged". I'm not sure whether it should be a STAC extension or not, but aligning the specs always makes sense to potentially re-use code etc. I think I need to dig deeper into this to figure out whether it would make sense to make it an extension. Are models "spato-temporal assets"? Would models be STAC Items, STAC Catalogs or Collections? I'm not sure on both yet and have pros and cons for all options. The biggest advantage would be that there is much tooling out there for STAC and it could help with adoption. But it's indeed a key point for discussion, I think. And also deserves a separate issue...

How does the DLM STAC extension relate to GMLMC? (By the way, this abbreviation is always confusing me...)

ymoisan commented 3 years ago

> Are models "spatio-temporal assets"? (Would models be STAC Items ?) That's the issue :-). You could think by virtue of being trained with images, which are obviously spatio-temporal assets, then their application domain (when it comes to inference) is probably dependent on the spatio-temporal envelopes of the set of training spatio-temporal assets and therefore models could "inherit" their spatio-temporal properties from their training data set. Or if your model generalizes really well then maybe their spatio-temporal properties only becomes a hint for users.

> How does the DLM STAC extension relate to GMLMC? Since the catalog is an endpoint to (a group) of models, then we need a way to describe those models so we can eventually make queries to an API, e.g. something like

"Here's an image from sensor S of area X at time T; what models could I use to extract buildings and roads from that image?".

Does it make sense that models be described as STAC [DLM] items which are then referenced in catalogs/collections ?

duckontheweb commented 3 years ago

Are models "spatio-temporal assets"? (Would models be STAC Items ?) That's the issue :-). You could think by virtue of being trained with images, which are obviously spatio-temporal assets, then their application domain (when it comes to inference) is probably dependent on the spatio-temporal envelopes of the set of training spatio-temporal assets and therefore models could "inherit" their spatio-temporal properties from their training data set. Or if your model generalizes really well then maybe their spatio-temporal properties only becomes a hint for users.

How does the DLM STAC extension relate to GMLMC? Since the catalog is an endpoint to (a group) of models, then we need a way to describe those models so we can eventually make queries to an API, e.g. something like

"Here's an image from sensor S of area X at time T; what models could I use to extract buildings and roads from that image?".

Does it make sense that models be described as STAC [DLM] items which are then referenced in catalogs/collections ?

I think this is an important enough discussion that I'd like to move it to its own thread. I'll start a new issue and past in comments from this one.

duckontheweb commented 3 years ago

Moved the STAC Extension discussion to #30

duckontheweb commented 3 years ago

4. Do I understand correctly that model_type needs to be a string listed in the section Model Type? How can three be required at the same time if it's a single string that is allowed?

@m-mohr Actually, the top-level model_type field is an object with the fields described in the Model Type section. Some of those fields (e.g. learning_approach, prediction_type) have suggested/encouraged values, but others are just free text. Maybe there is a way to make this more clear in the docs?

m-mohr commented 3 years ago

@m-mohr Actually, the top-level model_type field is an object with the fields described in the Model Type section. Some of those fields (e.g. learning_approach, prediction_type) have suggested/encouraged values, but others are just free text. Maybe there is a way to make this more clear in the docs?

Ah, I see. In STAC we append "Object" to the types, so "Model Type Object" to make it more clear.

duckontheweb commented 3 years ago

That makes sense. I’ll go through and try to be consistent about that throughout the spec