pmelchior / pygmmis

Gaussian mixture model for incomplete (missing or truncated) and noisy data
MIT License
98 stars 22 forks source link

Early termination with truncated samples #11

Closed philastrophist closed 3 years ago

philastrophist commented 5 years ago

When running some tests on simple models (2 dimensions and 3/4 components) I find that untruncated (i.e. sel_callback=None) performs better on these models than when I give pygmmis the selection function (i.e. sel_callback=selection_function).

I've created a gist with my script here

The likelihood seems to decrease immediately when moving away from the k-means estimate. However, if I disable convergence detection and let them run forever, then we see convergence for some split-merge runs. In fact, the log_L jumps around quite a bit before finally converging (sometimes).

The figures below show my problem, the ellipses are 1-sigma; observed data is in green, unobserved data in red. Red ellipses are the kmeans estimate, Black ellipses are the pygmmis estimate Blue ellipses are the truth

With tolerance pygmmis

Without tolerance pygmmis-converged

Loglikelihood

log_like

We need to detect convergence better otherwise it'll get stuck in a local minimum. Maybe a t-test for flat lines with tolerances?

pmelchior commented 3 years ago

OK, apologies that it took a while to get back to this issue. Life...

Thanks for your clear report, I can confirm it. However, PR #12 is only a workaround, not the actual solution. So let me address the problem here.

The problem, as you have determined yourself, is not in the EM equations themselves, otherwise the solution wouldn't have converged to something meaningful, but in the reported likelihood, which is used for convergence testing.

In a completely observed GMM, the likelihood of a sample x is

p(x | GMM) = 1 / Z sum_k alpha_k p_k(x)

with Z = int dx sum_k alpha_k p_k(x). Since the p_k or normalized and the alphas sum to 1, this normalization constant is dropped. But for incomplete samples, we cannot do that. The process that led the the observed samples is

p (x | GMM, Omega) = 1 / Z' Omega(x) sum_k alpha_k p_k(x)

with Z' = int dx Omega sum_k alpha_k p_k(x). As Omega is <= 1, this normalization term boosts the value of logL. Z' tells us the fraction of the probability mass that is observed. It can be very efficiently computed with Monte Carlo integration. My last commit 0e33ac6 does exactly that. With it the drop in logL doesn't arise and #12 is not necessary.