haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6k stars 1.12k forks source link

Mixture models with weights #121

Closed chhh closed 7 years ago

chhh commented 7 years ago

Hi @haifengl , I was wondering if it's possible to use mixture models with with weighted data. The use case is: fitting histograms with mixture models.

Currently, AFAIU, only mixtures from raw data points are supported. E.g. EM GMM estimation (smile.stat.distribution.GaussianMixture(double[] data) constructor) currently supports building the GMM from data using BIC as the stopping criterion for the number of components. I was trying to modify that by adding two interfaces:

public interface DataProvider {
    /** Value i-th data point. */
    double get(int i);
    /** Total number of data points. */
    int size();
}

public interface WeightProvider {
    /** Weight of i-th data point. */
    double get(int i);
}

public GaussianMixture(DataProvider data, WeightProvider weights)

And I've changed all downstream calculations to reflect the weights (including the M step in GaussianDistribution), but never got it to work properly. BIC calculation always breaks.

haifengl commented 7 years ago

I am not sure if weighted data are plausible to mixture model mathematically. EM algorithm may not converge at all without carefully updated formulas.

chhh commented 7 years ago

Ok, thanks!
So the weights should only be allowed to be integers then (in which case formula updates are trivial)?

haifengl commented 7 years ago

It is not what I said. It is not about integers.

haifengl commented 7 years ago

Suppose your mixture components are Gaussian distributions. The weights may have some impacts on the estimation of means. But it has little impacts on the estimation of covariance matrix. If it does, the impact may be not good as the covariance matrix may be closer to singular. I don't recommend to bin data and then do pdf estimation.

chhh commented 7 years ago

My dataset is 1D and the problem is the following. I have several million 1D data points for measurements. I do Kernel Density Estimation and plot it and see some nice structure. Here are examples.

Red is KDE estimate, Blue is the sum of GMM components (GMM fitted using BIC criterion), other colors are separate components of the GMM.

Here are three distinct peaks in KDE, but only one component for the GMM. kde-3-peaks_gmm-1-peak

Here are two peaks in KDE, but the GMM fits the main peak and the broad noise distribution (green) kde-2-peaks_gmm-1-peak

The GMM is being fitted using the raw data. There might be some bias in the original data, e.g. imagine floating numbers only being stored in a text file with 2 digits after the dot, so a lot of points might fall into the same exact spot (this should not be happening, but it what it is). KDE fixes that and produces the expected results, however fitting the GMM doesn't seem to work that well. So I wanted to fit GMMs to KDE estimates (which are like smoothed histograms). Right now I'm just distributing a fixed number of points (e.g. 1000 according to the KDE distribution and fitting that with the GMM. But I need to do this many many times, so this gets slow if I use a lot of fake points. It can be fast though, if points were just weighted. That's what I meant by "integer weighting" in this case.

haifengl commented 7 years ago

Regularization doesn't work for 1D data. Set the regularization parameter to 0, you should be fine.

chhh commented 7 years ago

Thanks a lot for the suggestion, it worked!

I didn't find any simple way to set the regularization parameter to 0 other than to copy GaussianMixture class and change the way EM is called in the GaussianMixture(double[] data) there.

Secondly, if you're saying that regularization doesn't work for 1D data, then maybe it would be a good idea to change the default behavior of GaussianMixture(double[] data) to use gmma = 0.0 or some other gamma, because right now it's 0.2 which is also the maximum allowed value.

haifengl commented 7 years ago

Thanks! I change the default value 0.0 for 1D mixture models. It is in master now.

chhh commented 7 years ago

Thanks @haifengl