open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Problem of Overfitting Data #104

Open akim1 opened 9 years ago

akim1 commented 9 years ago

For the problem of clustering, is there a rigorous framework for determining at what point your clustering/algorithm is simply fitting noise/artifacts and not meaningful data? For instance, in the extreme case, having 10 clusters for a graph of 10 nodes won't yield much insight. Or is this something that is normally fit into the penalty function?

ajulian3 commented 9 years ago

Couldn't another algorithm be developed to remove all of the data that doesn't display complexity from the training data alone? Specifically, since we are looking for a predominant pattern within the neural network information we shouldn't use the unnecessary training data points.

SandyaS72 commented 9 years ago

I'm guessing that if you looked at the distance between clusters vs within clusters, if there was some actual structure with fewer clusters, there would be a huge jump when you reach that number of clusters (starting from 1). After that if you kept increasing, I'm guessing performance would either not change by very much or maybe even decrease. If there is really no jump in performance in your data until you reach a really high number of clusters where you think it's just noise, I would think that says something in itself about the structure of your data.....it might be along the lines of what you suggested that you need to remove some of the points, or it could be that your data just doesn't have a strong cluster structure.

mblohr commented 9 years ago

Some approaches to avoid overfitting in this case include setting k by cross-validation and removing noise, i.e., each point where all surrounding points are a different class.