pmelchior / pygmmis

Gaussian mixture model for incomplete (missing or truncated) and noisy data
MIT License
98 stars 22 forks source link

Cutoff: explanation/documentation #9

Open philastrophist opened 5 years ago

philastrophist commented 5 years ago

In performing some tests of pygmmis I have found that varying the cutoff argument drastically changes the end result of fitting even with split-and-merge turned on (and exhaustive).

My understanding of EM is that the responsibilities r_ik are calculated for all data and all components. Why then, does pygmmis use a cutoff to fit only to those data in the neighbourhood of each component? As far as I can understand, cutoff!=inf simply means that it will be labelling some data as not belonging to any component.

Is the reason something to do with the background or is it just to avoid outliers?

Thanks

P.S. This code is very cool!

pmelchior commented 5 years ago

The cutoff argument is meant to allow for speed-ups in cases of many components. It is unlikely that you will need all components for every sample, so setting e.g. cutoff=3 doesn't even attempt to fit samples outside of the 3-sigma region of a component. This works very well for data that are spread out a lot, and it also helps break degeneracies for many strongly overlapping components.

I realize that I should document this parameter better, you're not the first person to ask.

philastrophist commented 5 years ago

Ah ok that makes sense, cutoff=None raises errors though, so I guess for now it's easier to just set cutoff=inf for my purposes.

pmelchior commented 5 years ago

There shouldn't be errors with cutoff=None. Can you post the error and the traceback, please.

philastrophist commented 5 years ago

Its an attribute error, trying to copy a None

ITER    SAMPLES LOG_L   STABLE
0   5000    -2.383  3
Traceback (most recent call last):
  File "/local/home/sread/Apps/anaconda/envs/pymc3-uptodate/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-f61dddb7d343>", line 4, in <module>
    runfile('/local/home/sread/Dropbox/pygmmis/models.py', wdir='/local/home/sread/Dropbox/pygmmis')
  File "/local/home/sread/Apps/jetbrains-toolbox-1.4.2492/install_location/apps/PyCharm-P/ch-0/182.4129.5/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/local/home/sread/Apps/jetbrains-toolbox-1.4.2492/install_location/apps/PyCharm-P/ch-0/182.4129.5/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/local/home/sread/Dropbox/pygmmis/models.py", line 122, in <module>
    split_n_merge=gmm.K * (gmm.K - 1) * (gmm.K - 2) / 2)
  File "/local/home/sread/Dropbox/pygmmis/pygmmis.py", line 689, in fit
    U_ = [U[k].copy() for k in xrange(gmm.K)]
  File "/local/home/sread/Dropbox/pygmmis/pygmmis.py", line 689, in <listcomp>
    U_ = [U[k].copy() for k in xrange(gmm.K)]
AttributeError: 'NoneType' object has no attribute 'copy'
pmelchior commented 5 years ago

Can you post the call of pygmmis.fit as well please.

philastrophist commented 5 years ago

Sure. It is here:

logL, U = pygmmis.fit(gmm, data, init_method='kmeans', w=0.01, cutoff=None, tol=1e-6, rng=rng, maxiter=1)