mintproject / Mint-ModelCatalog-Ontology

Model Catalog Ontology
0 stars 4 forks source link

What is a model? #34

Closed khider closed 4 years ago

khider commented 5 years ago

We need a definition of model vs data transformation

From @dnfeldman:

"In terms of making the distinction between a transformation and a model, I don’t think it should be based on what needs to be displayed vs not because 1) that decision boundary will be subjective and 2) at some point, someone will want to take a peek inside anyways because Joshua & Co are against black boxes (despite preaching simplicity).

The word ‘model’, as I’ve discovered, means totally different things to different people, so it’s definitely important to come up with an unambiguous definition. For that we usually turn to math. And that’s where my “reversible = transformation; irreversible = model” is coming from. Sure, some modelers might object that PIHM is not in the same category of models as e.g. Linear Regression, but at least that distinction is intuitive and easy to explain to anyone.

And in terms of what needs to be displayed - there are several approaches:

I’m sure there are other ways in which we can reduce the visual clutter and cognitive overhead, but I guess what I’m saying is that the decision of what to display vs what to hide should come from the intrinsic properties of the things we are displaying instead of extrinsic (and subjective) term definitions. "

Answer:

dnfeldman commented 5 years ago

Perhaps I should clarify a bit what I mean by 'reversibility'. To me, an input to a model/transformation (call it operation Op) consists of input data + configuration parameters and produces some output. A reversible Op is the one where if you know input configuration parameters and the output, you can recover input data exactly.

So for linear regression, if your Op takes input data x, y, configuration parameter {} (blank) and outputs values for a, b, and let's say some sort of error measurement r, then this Op is irreversible. But if your Op takes input data x, configuration parameter {a: 3, b:10} and produces y = 3 * x + b, then it is reversible.

I suppose in that sense this is somewhat similar to Yolanda's definition?

dgarijo commented 5 years ago

Relative humidity can be calculated from precipitation and temperature, and it is reversible.

Infiltration rate may be approximated by a user according to precipitation (e.g., inf = 1/4*precipitation).

Both are models, and yet, they are reversible.

We can provide a definition, but if there is no clear consensus I think we should let the modeler who adds the model in the model catalog decide. If they consider that what they are doing is a data transformation, then we can label it as such.

dnfeldman commented 5 years ago

Hmm, so if the same operation (e.g., y = 1/4 * x) can be both a model and a transformation, depending on how the input/output variables are named, why do we need to make a distinction between the two in the first place? Can we call everything a model and avoid making the end-user learn extra terminology?

khider commented 5 years ago

Everything is reversible to some extent. If a closed form solution exist, like in the case of a linear regression, exists, then it’s very easy to reverse. If the solution is a numerical approximation, then it’s harder.

Complex models tend to be numerical, hence “irreversible”

The reason we care is whether the user would need to provide data transformation. My understanding is that we were supposed to do this automatically. I think we all realize that’s it’s more complex and maybe we should call all these things models.

But we also need to decide what level of complexity to expose in the UI. So how do we say that PIHM is important to show but FLDAS to PIHM isn’t?

dnfeldman commented 5 years ago

Everything is reversible to some extent. If a closed form solution exist, like in the case of a linear regression, exists, then it’s very easy to reverse. If the solution is a numerical approximation, then it’s harder.

Complex models tend to be numerical, hence “irreversible”

I think we need to be a bit more careful here - our models/transformations don't always operate on numbers. And even when they do, we are not always dealing with "physics-based algebra" (for lack of a better term). For example, PIHM_GIS can be thought of as joining several datasets together (dem, weather, soil, land cover, etc) to generate a triangular mesh. Using relational algebra, this can be expressed in a closed form, but the operation is definitely not invertible. Other models (like fuzzy search, for example) don't even operate on numbers. All of this is to strengthen your point above (that it's possible to argue that anything can be a model)...

So then, instead of thinking about this in terms of models vs transformations (or complex vs simple), what if we can take advantage of the DAG structure of workflows to hide 'unimportant' edges? For example, if a parent node P has only one child C, which only has P as a parent, we can collapse that edge into a single node. This situation would happen if an operation has a prerequisite. For instance, PIHM_GIS is only ever run as a prerequisite for PIHM, so run PIHM_GIS -> run PIHM can be absorbed into run PIHM node.