What is a model? - Githubissues

khider commented 5 years ago

We need a definition of model vs data transformation

From @dnfeldman:

"In terms of making the distinction between a transformation and a model, I don’t think it should be based on what needs to be displayed vs not because 1) that decision boundary will be subjective and 2) at some point, someone will want to take a peek inside anyways because Joshua & Co are against black boxes (despite preaching simplicity).

The word ‘model’, as I’ve discovered, means totally different things to different people, so it’s definitely important to come up with an unambiguous definition. For that we usually turn to math. And that’s where my “reversible = transformation; irreversible = model” is coming from. Sure, some modelers might object that PIHM is not in the same category of models as e.g. Linear Regression, but at least that distinction is intuitive and easy to explain to anyone.

And in terms of what needs to be displayed - there are several approaches:

Rely on UI to help hide complexity (e.g., have functionality that allows workflow nodes to be grouped and collapsed)
Automatically group all prerequisites for a given model (so if PIHM requires to run evaporation or something, it should be automatically included as part of “Run PIHM” node)

I’m sure there are other ways in which we can reduce the visual clutter and cognitive overhead, but I guess what I’m saying is that the decision of what to display vs what to hide should come from the intrinsic properties of the things we are displaying instead of extrinsic (and subjective) term definitions. "

Answer:

The definition of reversibility vs irreversibility is not incompatible with what a user would want to see. If something is irreversible, I usually want to know about it. I agree that the definition shouldn't be based solely on what the user wants to see but I think we need to take into consideration that this is something they really care about.
Inverse modeling is also a branch of modeling. As its name indicates, it's trying to solve the inverse problem. So for a linear regression, the forward model is y=ax+b. The inverse solution is x =(y-b)/a. Therefore is linear regression really a model? It's reversible (in an inverse modeling sense). A lot of models are not easily irreversible (e.g., PIHM) but I'm a little hesitant to use the definition of irreversibility.
I like the definition from @yolandagil from a year ago: models create new variables. Data transformation do not.

dnfeldman commented 5 years ago

Perhaps I should clarify a bit what I mean by 'reversibility'. To me, an input to a model/transformation (call it operation Op) consists of input data + configuration parameters and produces some output. A reversible Op is the one where if you know input configuration parameters and the output, you can recover input data exactly.

So for linear regression, if your Op takes input data x, y, configuration parameter {} (blank) and outputs values for a, b, and let's say some sort of error measurement r, then this Op is irreversible. But if your Op takes input data x, configuration parameter {a: 3, b:10} and produces y = 3 * x + b, then it is reversible.

I suppose in that sense this is somewhat similar to Yolanda's definition?

dgarijo commented 5 years ago

Relative humidity can be calculated from precipitation and temperature, and it is reversible.

Infiltration rate may be approximated by a user according to precipitation (e.g., inf = 1/4*precipitation).

Both are models, and yet, they are reversible.

We can provide a definition, but if there is no clear consensus I think we should let the modeler who adds the model in the model catalog decide. If they consider that what they are doing is a data transformation, then we can label it as such.

dnfeldman commented 5 years ago

Hmm, so if the same operation (e.g., y = 1/4 * x) can be both a model and a transformation, depending on how the input/output variables are named, why do we need to make a distinction between the two in the first place? Can we call everything a model and avoid making the end-user learn extra terminology?

khider commented 5 years ago

Everything is reversible to some extent. If a closed form solution exist, like in the case of a linear regression, exists, then it’s very easy to reverse. If the solution is a numerical approximation, then it’s harder.

Complex models tend to be numerical, hence “irreversible”

The reason we care is whether the user would need to provide data transformation. My understanding is that we were supposed to do this automatically. I think we all realize that’s it’s more complex and maybe we should call all these things models.

But we also need to decide what level of complexity to expose in the UI. So how do we say that PIHM is important to show but FLDAS to PIHM isn’t?

dnfeldman commented 5 years ago

Everything is reversible to some extent. If a closed form solution exist, like in the case of a linear regression, exists, then it’s very easy to reverse. If the solution is a numerical approximation, then it’s harder.

Complex models tend to be numerical, hence “irreversible”

I think we need to be a bit more careful here - our models/transformations don't always operate on numbers. And even when they do, we are not always dealing with "physics-based algebra" (for lack of a better term). For example, PIHM_GIS can be thought of as joining several datasets together (dem, weather, soil, land cover, etc) to generate a triangular mesh. Using relational algebra, this can be expressed in a closed form, but the operation is definitely not invertible. Other models (like fuzzy search, for example) don't even operate on numbers. All of this is to strengthen your point above (that it's possible to argue that anything can be a model)...

So then, instead of thinking about this in terms of models vs transformations (or complex vs simple), what if we can take advantage of the DAG structure of workflows to hide 'unimportant' edges? For example, if a parent node P has only one child C, which only has P as a parent, we can collapse that edge into a single node. This situation would happen if an operation has a prerequisite. For instance, PIHM_GIS is only ever run as a prerequisite for PIHM, so run PIHM_GIS -> run PIHM can be absorbed into run PIHM node.

mintproject / Mint-ModelCatalog-Ontology

What is a model? #34