probcomp / cgpm

Library of composable generative population models which serve as the modeling and inference backend of BayesDB.
Apache License 2.0
25 stars 11 forks source link

engine.transition and engine.transition_lovecat output qualitatively different crosscat states after the same number of iterations #158

Closed leocasarsa closed 7 years ago

leocasarsa commented 8 years ago

All renderings are saved in: https://probcomp-2.csail.mit.edu:8883/tree/out

cc stands for cgpm-crosscat.

fsaad commented 8 years ago

In addition to the rendering above, can you please provide minimal working examples of the analysis scripts used to generate them? I wonder if the same qualitative behavior (roughly one view, one large noisy cluster) remains when varying the primitive CGPM used to model the variables between Bernoulli, Categorical, and Normal. There might an issue in the implementation of the collapsed samplers for one (or both) of the former two.

fsaad commented 8 years ago

Investigating further, it seems to be related to the difference in the hyperparameter grids for Categorical (and Bernoulli) between cgpm-crosscat and lovecat.

In lovecat, the alpha_grid is log-spaced from 1 to len(dataset). https://github.com/probcomp/crosscat/blob/master/cpp_code/src/utils.cpp#L421

In cgpm, the alpha-grid is log-spaced from 1 / len(dataset) to len(dataset). https://github.com/probcomp/cgpm/blob/master/src/primitives/categorical.py#L110 (Categorical) https://github.com/probcomp/cgpm/blob/master/src/primitives/bernoulli.py#L113-L116 (Bernoulli)

For the Beta prior, small values of alpha and beta force the distribution to look like a "smile", driving the parameters to extreme values of 0 and 1.

leocasarsa commented 8 years ago

Re the minimal working examples, sorry for skipping that message. I can provide you with the code in an hour if you still need it.

fsaad commented 8 years ago

I found the scripts at: https://probcomp-2.csail.mit.edu:8883/notebooks/animals_cc_experiment.ipynb and am reproducing the tests cases locally with potential fixes applied.

fsaad commented 7 years ago

Migrating in from Slack

It turns out the original experiments run by @leocasarsa used a Bernoulli for cgpm and categorical for lovecat (which does not have a Bernoulli component model). The difference in the states ultimately fell down to this fact; running inference on the animals dataset using a dirichlet-categorical rather than beta-bernoulli results in indistinguishable posterior states to lovecat.

It's due to the way that CrossCat implements the dirichlet-categorical -- basically it forces a symmetric-dirichlet, so the beta-bernoulli sampler (which allows the beta hyperp0arameters alpha and beta to be arbitrary i.e. not the same) is not a special case of the dirichlet-categorical and the plots produced above were two samplers with essentially different priors.

Rerunning all the experiments on the animals dataset using (i) normal and (ii) categorical component models produces qualitatively similar posterior samples using cgpm and lovecat in both cases.

Here is the dependence probability matrix (left cgpm, right lovecat) using categorical component models with 900 iterations of analysis, and the same row/column ordering:

image

This test case has been committed to https://github.com/probcomp/cgpm/blob/master/tests/graphical/animals.py

The artifacts produced are too large for Github, but we should save the .engine files and plots using Git LFS or something similar.

It would be worth us additionally developing some intuition about why the asymmetric-dirichlet for the categorical prior results in dpmm-like posterior samples, as oppose to the cross-cutting partitions generated by the symmetric-dirichlet.