jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.29k stars 590 forks source link

Gaussian Mixture yields large negative improvement on first step using weighted data with large range in weights #1096

Open jmsvennson opened 2 months ago

jmsvennson commented 2 months ago

I am trying to fit a Gaussian mixture model to a dataset using weights for each sample. My weights have a large distribution, ranging from 1 to ~1e9. The weights were calculated from a smooth function of the raw data, so there should be no abrupt changes in the weight for nearby values of the data. But the function is exponential, so the multiplicative range is unavoidably large.

If I run the fit without the weights, everything works as expected and I get a good fit to the distribution of the raw data. However, when the weights are included, I get an immediate negative improvement on the first iteration (or very nearly the first iteration, with the negative improvement after a small number of steps being larger than the total positive improvement) and the fit ends. The initialization using Kmeans works fine, and seems to effectively incorporate the weights (I know quite a bit about what the end result should look like in this case so I can confirm that the weighted results from Kmeans are close to the correct result).

I tried rescaling the weights to [0,1] but this did not improve matters. I have also tried turning up the inertia as high as I can to reduce the step size, but this seems to have no effect.

I am using the latest version of Pomegranate on Windows 10.

I'm not quite sure how to provide a reproducible example as the problem seems data-specific, except that I think it's the large number of of orders of magnitude spanned by the weights that causes the problem. Artificially rescaling the weights to span a smaller multiplicative range alleviates the issue, but of course completely ruins the fit I am trying to achieve.

The fact that Kmeans seems to incorporate these weights just fine gives me some hope that there is a potential resolution to this issue.

jmschrei commented 1 week ago

Howdy. Sorry for the delay in getting back to you. Unfortunately, without an example it can be difficult for me to look deeper into the issue. I imagine that there is an overflow happening somewhere. What sorts of ranges did you find provided reasonable results?