probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
923 stars 63 forks source link

Uniformity feature: Support mutual information for cells of observed rows #250

Open axch opened 9 years ago

axch commented 9 years ago

Should be much faster than unobserved rows, because the cluster assignment in each model is assumed known. The current MI code path I am aware of only computes MI of columns for unobserved (new) rows

gregory-marton commented 8 years ago

Being a little unclear on the math, I'm not sure what the use case looks like. Probability of dependence estimates the probability that mutual information between columns is nonzero? Similarity measures the mutual information between observed rows? What does this effectively mean for individual observed cells?

riastradh-probcomp commented 8 years ago

The architecture is that we have an infinite exchangeable set of tuples of random variables {(A_r, B_r, C_r)}_r, and we approximate the posterior distribution given certain assignments A_0 = a_0, C_1 = c_1, &c. Currently we can only approximate mutual information for two random variables A_i, B_i in the same row i for which no values have been assigned. This issue is to allow approximating it for two random variables from a row that has been observed.

The architecture more specifically for Crosscat is that there are additional categorical variables {(L_r, M_r, N_r)}_r which we cannot observe, nor even whose number can we observe. Each Crosscat state is a sample from the distribution on latent variable numbers (views) and assignments (categories). Each model estimator evaluates a Monte Carlo integral (1) over samples of Crosscat states of some function of a single Crosscat state.

In this case, approximating mutual information of variables of an entirely unobserved row from a single Crosscat state means evaluating a Monte Carlo integral (2) over samples of category assignments of some mutual information estimator (itself a Monte Carlo integral (3) over samples of the posterior predictive distribution on the variables given the category assignments).

What @axch proposes is to do is to implement approximation of mutual information of variables of an observed row from a single Crosscat state, in which implementation the Monte Carlo integral (2) is replaced by a single evaluation of (3), given the fixed category assignments of that observed row in that Crosscat state, rather than a Monte Carlo integral over samples of category assignments of evaluations of (3).