gregversteeg / bio_corex

A flexible version of CorEx developed for bio-data challenges that handles missing data, continuous/discrete variables, multi-CPU, overlapping structure, and includes visualizations
Apache License 2.0
137 stars 30 forks source link

Heterogenous data types #14

Open buhrmann opened 5 years ago

buhrmann commented 5 years ago

Hi, in one of your papers it is mentioned that in principle CorEx works with heterogeneous data types, but it seems that the current implementation only works for all continuous or all discrete data matrices. If that's correct, do you plan to support mixed continuous and categorical types in the future?

gregversteeg commented 5 years ago

Hi, sorry for the delay in responding to this. One of the great things about the information-theoretic formulation is that it does make sense to put information about a continuous variable and information about a discrete variable on the same footing. However, you're right that the current implementations don't allow mixing, and I don't have plans to implement that.

If your main interest is mixing continuous variables and binary variables, then I recommend using CorEx in continuous mode (with -c option from command line), and encoding the binary variable with any two values (0/1, or -1/+1, e.g.). The way the marginal probabilities are modeled in this case (with mixtures of Gaussians around each binary value) should be equivalent to modeling them as binary. However, if your categorical variables take more than two values, say X_i = "cat", "dog", "bird", and you encode those as X_i=0,1,2, then you lose some of the meaning of the categorical formulation because in the continuous formulation "2" is closer to "1" than it is to "0" (according to the Gaussian mixture model that we use to model, that is), but this is not really true for our original categorical variables.