Closed philastrophist closed 3 years ago
OK, apologies that it took a while to get back to this issue. Life...
Thanks for your clear report, I can confirm it. However, PR #12 is only a workaround, not the actual solution. So let me address the problem here.
The problem, as you have determined yourself, is not in the EM equations themselves, otherwise the solution wouldn't have converged to something meaningful, but in the reported likelihood, which is used for convergence testing.
In a completely observed GMM, the likelihood of a sample x is
p(x | GMM) = 1 / Z sum_k alpha_k p_k(x)
with Z = int dx sum_k alpha_k p_k(x). Since the p_k or normalized and the alphas sum to 1, this normalization constant is dropped. But for incomplete samples, we cannot do that. The process that led the the observed samples is
p (x | GMM, Omega) = 1 / Z' Omega(x) sum_k alpha_k p_k(x)
with Z' = int dx Omega sum_k alpha_k p_k(x). As Omega is <= 1, this normalization term boosts the value of logL. Z' tells us the fraction of the probability mass that is observed. It can be very efficiently computed with Monte Carlo integration. My last commit 0e33ac6 does exactly that. With it the drop in logL doesn't arise and #12 is not necessary.
When running some tests on simple models (2 dimensions and 3/4 components) I find that untruncated (i.e.
sel_callback=None
) performs better on these models than when I givepygmmis
the selection function (i.e.sel_callback=selection_function
).I've created a gist with my script here
The likelihood seems to decrease immediately when moving away from the k-means estimate. However, if I disable convergence detection and let them run forever, then we see convergence for some split-merge runs. In fact, the
log_L
jumps around quite a bit before finally converging (sometimes).The figures below show my problem, the ellipses are 1-sigma; observed data is in green, unobserved data in red. Red ellipses are the kmeans estimate, Black ellipses are the
pygmmis
estimate Blue ellipses are the truthWith tolerance![pygmmis](https://user-images.githubusercontent.com/14169057/51193436-c4806300-18e0-11e9-8050-d51dba6a9145.png)
Without tolerance![pygmmis-converged](https://user-images.githubusercontent.com/14169057/51193276-6489bc80-18e0-11e9-8166-e83dadec74a8.png)
Loglikelihood
We need to detect convergence better otherwise it'll get stuck in a local minimum. Maybe a t-test for flat lines with tolerances?