gregversteeg / LinearCorex

Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data
Apache License 2.0
42 stars 13 forks source link

Finding the optimal number of hidden factors #4

Open RobertoNegro opened 5 years ago

RobertoNegro commented 5 years ago

Hello, me and my team we're trying to understand how to choose the optimal number of hidden factor. For what we've found, the goal is to maximize the TC (Total Correlation). But after some tries with different settings, the value obtained by the property tc is always increasing when increasing the number of hidden factors. We have doubts about the TCs property too, since we're not sure on the meaning: after some execution, the median of the TCs rapidly decrease with the increase of hidden factors. But we're not sure how to interpret that.

So, basically, the main problem is: how can we choose the optimal number of hidden factors?

Thank you, Roberto

X the number of hidden factors, Y the TC value

TC

X the number of hidden factors, Y the TCs median

TCs median
gregversteeg commented 4 years ago

Hi Roberto, sorry to take so long to respond. You should look in the code for a method called "pick_n_hidden". It basically tries to different numbers of factors and peaks at some optimal value.

Your experiment is absolutely correct though: the lower bound on TC only keeps going up! However, if you enforce that each variable has only one latent factor as a parent, then this doesn't happen (accessible as corex.moments["TC_no_overlap"]). In that case, adding factors causes the TC (without overlaps) to go up and then plateau and possibly go back down.