Open axch opened 9 years ago
Being a little unclear on the math, I'm not sure what the use case looks like. Probability of dependence estimates the probability that mutual information between columns is nonzero? Similarity measures the mutual information between observed rows? What does this effectively mean for individual observed cells?
The architecture is that we have an infinite exchangeable set of tuples of random variables {(A_r, B_r, C_r)}_r, and we approximate the posterior distribution given certain assignments A_0 = a_0, C_1 = c_1, &c. Currently we can only approximate mutual information for two random variables A_i, B_i in the same row i for which no values have been assigned. This issue is to allow approximating it for two random variables from a row that has been observed.
The architecture more specifically for Crosscat is that there are additional categorical variables {(L_r, M_r, N_r)}_r which we cannot observe, nor even whose number can we observe. Each Crosscat state is a sample from the distribution on latent variable numbers (views) and assignments (categories). Each model estimator evaluates a Monte Carlo integral (1) over samples of Crosscat states of some function of a single Crosscat state.
In this case, approximating mutual information of variables of an entirely unobserved row from a single Crosscat state means evaluating a Monte Carlo integral (2) over samples of category assignments of some mutual information estimator (itself a Monte Carlo integral (3) over samples of the posterior predictive distribution on the variables given the category assignments).
What @axch proposes is to do is to implement approximation of mutual information of variables of an observed row from a single Crosscat state, in which implementation the Monte Carlo integral (2) is replaced by a single evaluation of (3), given the fixed category assignments of that observed row in that Crosscat state, rather than a Monte Carlo integral over samples of category assignments of evaluations of (3).
Should be much faster than unobserved rows, because the cluster assignment in each model is assumed known. The current MI code path I am aware of only computes MI of columns for unobserved (new) rows