probcomp / PClean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning
216 stars 32 forks source link

Explore adding split/merge moves during inference #6

Open alex-lew opened 3 years ago

alex-lew commented 3 years ago

Here are a bunch of not particularly organized notes I had lying around about this...

Existing approaches to Split Merge in the literature:

A split-merge algorithm is made up of several components:

Some examples of existing approaches:

Choice of component(s) to split or merge in MCMC

[[Existing approaches to split-merge in the literature]] typically use one of the following strategies for deciding whether to split or merge, and if so, which components to target:

Schemes based on "chaperones" or "anchors" must contend with the question of, in their Allocation strategy for datapoints after a split in split-merge MCMC, whether to require that the two chaperones be proposed as belonging to distinct components. Problem: We want to choose anchor / chaperone objects that "close but not that close”; these are the ones that the existing algorithm likely has trouble with. If there were a measure of distance that only depended on the non-submodel nodes, this would be valid (but would cost O(n^2) to evaluate all the distances). User-defined criteria for "closer inspection" could be an interesting way around this.

Allocation strategy for datapoints after a split in split-merge MCMC

There are at least two strategies in the literature:

Sequential allocation of datapoints after a split in split-merge MCMC
Intermediate scans for allocation of datapoints in split-merge split proposals

When proposing a split in a split-merge MCMC algorithm, it may be necessary to propose a data association: for each observation previously associated with the latent entity being split, which new entitiy should it be associated with? One strategy was described by this paper: Dirichlet-Process Split-Merge.

In this paper, a split is proposed when two randomly chosen observations lie in the same component. The basic "randomized split” algorithm then forces the two observations into two distinct components in the proposal, before assigning all the other elements. This algorithm has the obvious downside that good splits will be rare to propose.

Fixed launch state argument. Jain and Neal begin by describing a strategy in which, after i and j are put in distinct components, the rest of the observations in S are allocated to the two components using some predetermined (fixed) strategy, yielding c^{launch}. Then a Restricted Gibbs scan is performed. Interestingly, this restricted Gibbs scan has its probabilities computed as part of the proposal distribution. This suggests that it needn't really be a Gibbs scan, but can just be a scan of “relatively good proposals." (In PClean, I think all these Gibbs proposals would happen based on 'surrogate' enumeration; this would all be part of the "smart" proposal.) For reasons I don't understand, the Gibbs sampling cannot modify the assignments of data points i and j. This is discussed as a weakness in Section 5, but I don't see why it's necessary. Perhaps it’s in order to “break the symmetry”: a particular split should only be possible to arrive at in one way. We use "the component that i is in" or "the component that j is in" as the 'names' of the two new components—useful in constructing a reverse move.

Random launch state. A "uniform" random launch state is permissible because it can be thought of as part of a state-independent "move selection." The paper argues that it is OK to replace the distribution of the launch state with several scans of restricted Gibbs—it claims this is also a "random launch state." (They claim that in computing the reverse move for a merge, it is necessary to do these n-1 scans -- i.e., actually sample them -- in order to generate a launch state.)

The algorithm proceeds via random initialization, followed by restricted Gibbs scans. Then a final restricted Gibbs scan is done as "part of the proposal."

In a PClean context, re-marginalizing for each proposed combination will be a necessity for this to work. It would be useful to think through whether this is possible to do efficiently. Can the results of enumeration be incrementally updated?

There does appear to be a non-conjugate version, but it’s actually only for conditionally conjugate models:

Selecting "anchor" or "chaperone" reference slots

Implementation questions

Can we measure how well average ExternalLikelihood detects good splits?

Can we invent a generic similarity metric, based on reference slot distribution?