Explore adding split/merge moves during inference

Here are a bunch of not particularly organized notes I had lying around about this...

Existing approaches to Split Merge in the literature:

A split-merge algorithm is made up of several components:

[[Choice of component(s) to split or merge in MCMC]]
[[Allocation strategy for datapoints after a split in split-merge MCMC]]
[[Strategy for setting new component parameters when proposing a split or merge in MCMC]]

Some examples of existing approaches:

Jain and Neal: Split-Merge for DPs (Paper)
- Choose components by uniformly selecting two chaperones.
- Allocate datapoints via [[Intermediate scans for allocation of datapoints in split-merge split proposals]].
- Conjugacy is assumed, so new component parameters are marginalized out. (A version exists for conditionally conjugate priors, which handles parameters during the intermediate scans.)
The Chaperones Algorithm
- Choose components by selecting two chaperones, perhaps based on a similarity measure.
- Allocate datapoints via sequential allocation—including of the two chaperones. (They may wind up in the same or different clusters, no matter how they started.)
- Conjugacy assumed (Dirichlet-Categorical, and Dirichlet parameters are marginalized out).
Richardson and Green: Reversible-Jump Split-Merge
- Choose components by flipping a coin to decide between split and merge, then uniformly choosing 1 or 2 distinct components.
- Because parameters are explicit, allocation decisions are mutually independent, conditioned on parameters. When proposing a split, allocation is performed using weighted Bernoulli draws with probabilities computed according to the parameters of each component.
- Parameters are proposed by altering existing components' parameters: e.g., averaging during a merge, and adding or subtracting a random quantity to or from each component during a split.

Choice of component(s) to split or merge in MCMC

[[Existing approaches to split-merge in the literature]] typically use one of the following strategies for deciding whether to split or merge, and if so, which components to target:

Choose two members ("anchors" or "chaperones") uniformly at random. If the two members are in the same component, we propose a split. If the two members are in different components, we propose a merge.
Choose two members ("anchors" or "chaperones") from a custom distribution based on similarity. This distribution must not depend on the current partition of members into components; it may also not depend on component parameters (except perhaps via summary statistics that the split/merge proposal is guaranteed to conserve).
Choose one or two components, based on the result of a coin flip. This is the typical reversible-jump split-merge algorithm. One may employ smart, state-dependent strategies for deciding which component(s) to select, but it may be necessary to use a [[Smart-Dumb Dumb-Smart strategy]] to maintain a reasonable acceptance rate.

Schemes based on "chaperones" or "anchors" must contend with the question of, in their Allocation strategy for datapoints after a split in split-merge MCMC, whether to require that the two chaperones be proposed as belonging to distinct components. Problem: We want to choose anchor / chaperone objects that "close but not that close”; these are the ones that the existing algorithm likely has trouble with. If there were a measure of distance that only depended on the non-submodel nodes, this would be valid (but would cost O(n^2) to evaluate all the distances). User-defined criteria for "closer inspection" could be an interesting way around this.

Allocation strategy for datapoints after a split in split-merge MCMC

There are at least two strategies in the literature:

Sequential allocation of datapoints after a split in split-merge MCMC

Intermediate scans for allocation of datapoints in split-merge split proposals

When proposing a split in a split-merge MCMC algorithm, it may be necessary to propose a data association: for each observation previously associated with the latent entity being split, which new entitiy should it be associated with? One strategy was described by this paper: Dirichlet-Process Split-Merge.

In this paper, a split is proposed when two randomly chosen observations lie in the same component. The basic "randomized split” algorithm then forces the two observations into two distinct components in the proposal, before assigning all the other elements. This algorithm has the obvious downside that good splits will be rare to propose.

Fixed launch state argument. Jain and Neal begin by describing a strategy in which, after i and j are put in distinct components, the rest of the observations in S are allocated to the two components using some predetermined (fixed) strategy, yielding c^{launch}. Then a Restricted Gibbs scan is performed. Interestingly, this restricted Gibbs scan has its probabilities computed as part of the proposal distribution. This suggests that it needn't really be a Gibbs scan, but can just be a scan of “relatively good proposals." (In PClean, I think all these Gibbs proposals would happen based on 'surrogate' enumeration; this would all be part of the "smart" proposal.) For reasons I don't understand, the Gibbs sampling cannot modify the assignments of data points i and j. This is discussed as a weakness in Section 5, but I don't see why it's necessary. Perhaps it’s in order to “break the symmetry”: a particular split should only be possible to arrive at in one way. We use "the component that i is in" or "the component that j is in" as the 'names' of the two new components—useful in constructing a reverse move.

Random launch state. A "uniform" random launch state is permissible because it can be thought of as part of a state-independent "move selection." The paper argues that it is OK to replace the distribution of the launch state with several scans of restricted Gibbs—it claims this is also a "random launch state." (They claim that in computing the reverse move for a merge, it is necessary to do these n-1 scans -- i.e., actually sample them -- in order to generate a launch state.)

Note for PClean applicability: In a simple DPMM, the restricted Gibbs scan depends not at all on the rest of the model state. But in PClean, we do presumably depend on some of the rest of the state. But maybe it doesn't matter because that part of the state does not change at all during the running of the algorithm.
Can this be understood in terms of involutive MCMC? Probably.

The algorithm proceeds via random initialization, followed by restricted Gibbs scans. Then a final restricted Gibbs scan is done as "part of the proposal."

In a PClean context, re-marginalizing for each proposed combination will be a necessity for this to work. It would be useful to think through whether this is possible to do efficiently. Can the results of enumeration be incrementally updated?

There does appear to be a non-conjugate version, but it’s actually only for conditionally conjugate models:

Selecting "anchor" or "chaperone" reference slots

One thing we can do is choose at random from the keys that have the same submodel observations because with different observations, it will never be possible to merge. This is OK because the observations can't change via a split/merge move.
Another possibility is to develop a smart split / dumb merge kernel. (Good merges will never be accepted, but…) The smart split could be to choose objects with particularly low average ExternalLikelihoods. The dumb merge could be random.

Implementation questions

Can we measure how well average ExternalLikelihood detects good splits?

Can we invent a generic similarity metric, based on reference slot distribution?

“How well does the single best object (from enumeration) explain these two cells, on average?”
- Note that we want “similar but not too similar,” which might be hard to do. i.e., how to tune? We could fit a 3-component log-normal mixture... and choose from the middle compnent?

probcomp / PClean