thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
166 stars 64 forks source link

Question about BIC #161

Closed dnlbunting closed 7 years ago

dnlbunting commented 7 years ago

Hello,

I’m trying to understand how the BIC is calculated in find.clusters, essentially this line in find.clust.R:

myStat <- N*log(c(WSS.ori,WSS)/N) + log(N) *c(1,nbClust)

As discussed in this stackoverflow questionhttp://stackoverflow.com/questions/15839774/how-to-calculate-bic-for-k-means-clustering-in-r the BIC for kmeans clustering is given by

BIC = WSS/N + log(N)*nbClust*d

So it looks like to me that you are using log(WSS/N) rather than WSS/N. This means that at large N as WSS -> 0 the log(WSS) term goes to -inf, which I think is not the expected behaviour.

Can you explain what I'm missing here?

thibautjombart commented 7 years ago

I'm not sure about the stack overflow definition. The definition used in the DAPC paper is the same as: https://en.wikipedia.org/wiki/Bayesian_information_criterion

(see 'Gaussian special case')

The statement that WSS decreases with N is erroneous. WSS is a sum of N terms, therefore it increases with N. There is no theoretical value towards which (WSS/N) tends when N is large.

Makes sense?

dnlbunting commented 7 years ago

After reading more it makes less sense, I couldn't find a definitive answer, different people seem to use slightly different forms for BIC. I guess it is because kmeans is heuristic algorithm with no proper likelihood...

With regard to the wss, surely when the number of clusters equals the number of data points each cluster has zero wss? So my BIC graph looks like this

bic

thibautjombart commented 7 years ago

I'd be keen to see the other sources, please feel free to link to them here.

I don't think it has to do with heuristics or a proper likelihood though. Heuristics are commonplace when it comes to finding ML solutions.

I think you're confusing 'N' in WSS for the number of clusters, but it is the number of observations.

Your BIC graph is a different question, and a slightly odd one indeed. Hard to tell what is going on without looking at the data, but it would be worth increasing the number of runs of k-means (n.start) to gain a bit of stability. How many alleles and individuals are there in the data?