open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Graph equivalents of Overfitting and Underfitting #77

Open mrjiaruiwang opened 9 years ago

mrjiaruiwang commented 9 years ago

We are trying to construct models of biological phenomena, but as with any model, I'm sure there is some form of "over" and "under" fitting, but I am unsure what those things would be and how we can calculate or measure the degree of over or under-fitting in our graphical models. By "fitting" I mean generating from data some kind of conclusion about it. I know there's a test called the cross validation error, but could that be adapted for graphs? I can think of a test off the top of my head: delete one node at random from the resulting map and regenerate the conclusion. If this new conclusion differs too much from the old conclusion before the random deletion, we should be over-fitting. The motivation for this I guess is that in our real brains we lose neurons every day, yet our day to day functionality seems to be relatively untouched.

jtmatterer commented 9 years ago

First off, by "graphical models", I'm assuming you mean probabilistic models on graphs and not the graphical models used in some areas of Bayesian statistics. In the case of a stochastic blockmodel, we've seen examples in class of overfitting and underfitting. If each vertex is its own block, I would certainly describe that as overfitting the data and conversely, in most applications, an Erdos-Renyi model would not capture a lot of the information the graph provides.

I'm not sure I understand the second half of your question.

maxcollard commented 9 years ago

There are two issues at play here: the first is the issue of over/underfitting, which is the result of a mismatch between the number of model parameters and the number of observed data points. Say work in the space of graphs with 30 vertices. If you fit a categorical model (which has 2^(n^2) - 1 parameters) from N=10 subjects, then you're obviously overfitting profusely. Similarly, if you use an Erdos-Remy model (which has 1 parameter) fit from N=2500 subjects, you're obviously losing information. Over/underfitting can be understood in much the same way as in other settings: number of parameters to fit vs. number of data points. A canonical test for this is to see if your model changes dramatically when you leave out an entire subject (or group of subjects).

The second issue, I think, is a little more profound than just overfitting: In all of our examples so far, we have just assumed a consistent set of vertices, and that our labeling of these vertices is accurate. It is certainly a legitimate question to ask: how are we sure that our assignments of the vertices are in any way accurate? For example, in fMRI, how do we know which voxels to group together and treat as "one node"? Jerry is, I think, asking for some quantitative measure of how much noise/inaccuracy a model allows for the vertex assignments while still being able to accurately represent phenomena. That's a hard question.