jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.36k stars 588 forks source link

Can't fit GMM with GammaDistribution #505

Closed rkrohn closed 4 years ago

rkrohn commented 5 years ago

Running version 0.10.0 installed via pip. I'm trying to fit a GMM consisting of two GammaDistributions.

I can fit the GMM using NormalDistributions:

numpy.random.seed(0)
X = numpy.random.gamma(2, 1, size=(500, 1))
X[::2] += 3
model = GeneralMixtureModel.from_samples(NormalDistribution, 2, X)

And the model initializes fine.

{
    "weights" : [
        0.7523022387697746,
        0.24769776123022538
    ],
    "distributions" : [
        {
            "parameters" : [
                4.2021923276063795,
                1.6248952492022728
            ],
            "frozen" : false,
            "name" : "NormalDistribution",
            "class" : "Distribution"
        },
        {
            "parameters" : [
                1.0163679834315575,
                0.5132402957231983
            ],
            "frozen" : false,
            "name" : "NormalDistribution",
            "class" : "Distribution"
        }
    ],
    "class" : "GeneralMixtureModel"
}

But if I try to fit the same data using two GammaDistributions, the initialization fails.

model = GeneralMixtureModel.from_samples(GammaDistribution, 2, X)
  File "fit_data.py", line 19, in <module>
    model = GeneralMixtureModel.from_samples(GammaDistribution, 2, X)
  File "pomegranate/gmm.pyx", line 601, in pomegranate.gmm.GeneralMixtureModel.from_samples
  File "pomegranate/distributions/distributions.pyx", line 333, in pomegranate.distributions.distributions.Distribution.from_samples
  File "pomegranate/distributions/GammaDistribution.pyx", line 87, in pomegranate.distributions.GammaDistribution.GammaDistribution.fit
  File "pomegranate/distributions/GammaDistribution.pyx", line 117, in pomegranate.distributions.GammaDistribution.GammaDistribution.summarize
ValueError: shapes (313,1) and (313,1) not aligned: 1 (dim 1) != 313 (dim 0)

I can also fit a single GammaDistribution, but introducing a GammaDistribution to any combination for the GMM produces the same errors.

jmschrei commented 5 years ago

I think this is a duplicate of #490. Can you try pulling the latest code?

rkrohn commented 5 years ago

I'm still getting the same error after installing using python3 setup.py install --user. Is there something else I'm missing?

jmschrei commented 5 years ago

I haven't pushed the change to pip yet, so there isn't a new version. What I meant was pulling the latest code frm GitHub and installing it manually. What OS are you on? Do you already have Cython installed?

rkrohn commented 5 years ago

I cloned the repo, and installed it manually using python3 setup.py install --user as described in https://pomegranate.readthedocs.io/en/latest/install.html. Working on a shared server that runs Ubuntu 14.04.5, Cython is installed. The manual install seemed to work fine, apart from some warnings, and I get this message at the end: Installed <...>/pomegranate-0.10.0-py3.4-linux-x86_64.egg. I also just tried on a Ubuntu 16.04.5 system and am still having the same problem.

I'm sure this is just an install issue on my end, so feel free to close if you'd like.

jjberry commented 5 years ago

I'm getting the same error. I've cloned the repo and built manually. I'm on Mac OS X 10.11.6, using the anaconda distribution.

RobinSomers commented 5 years ago

Me too I am dealing with the same error. I pulled the latest code from git and installed it on a Windows 10 machine. I feel like this is a different issue than dealt with in #490 , as the chuck of code posted in the last reply works for me (after dealing with some dependency issues concerning the networkx-package). It does indeed nicely separate the two gamma functions and it is worth noting that both the reshape(-1,1) and reshape(1,-1) produce the desired results.

However what I really want to accomplish, the GMM of gamma's and not the HMM does not work. I have a one-dimenisonal array and thus will reshape this. I tried both reshape(-1,1) and reshape (1,-1) The first one does still give me the mentioned error. When I use reshape(1,-1) thus resulting in a vector with size(1,something), this does not give me an error but a lot of RunTimeWarnings (invalid value, divide by zero,...) and all of the fitted GammaDistributions have NaN parameters.

I am not sure what I am doing wrong, or the bug mentioned in #490 still persists in some way.

jmschrei commented 5 years ago

It may be helpful if you pull the most recent code, which supports custom distributions, and see if you can code a gamma distribution that does work. If you can find the bug I'd be happy to include it asap in the built-in one. Sorry for the delay!

jmschrei commented 5 years ago

This tutorial may be helpful for implementing custom distributions: https://github.com/jmschrei/pomegranate/blob/master/tutorials/C_Feature_Tutorial_5_Custom_Distributions.ipynb

RobinSomers commented 5 years ago

Hi Jacob, thanks for the reply. I pulled the latest version of the code, yet the bug still persists. For now however I will use the HMM to fit two gamma-distributions, as described in #490 . Later on, I will maybe try to implement my own custom gamma distribution or try to look for the bug, but for now the HMM fits my needs.

CTyny commented 5 years ago

Hi everyone, I also hit this seemingly nonsensical ValueError while trying to fit a GMM of gamma distributions with the latest pomegranate version available from conda-forge (0.11.0). I don't want to use the HMM workaround so I had a poke around to see if I could find the bug instead.

It appears that GammaDistribution.pyx has two summarize functions, one in pure python and one in C. The ValueError can be traced to the python version at the point when the summaries are calculated and probably arises because numpy.dot won't evaluate the dot product of a (n, 1) and (n, ) array (data and weights respectively).

I cobbled together a pure python custom univariate Gamma Distribution called GammaDistribution2 for testing, basically by hastily replacing the small amount of C from GammaDistribution.pyx and adding in any missing requirements specified in tutorial 5. Passing numpy.dot a reshaped (n, ) copy of the data (items) seems to work for me without causing any obvious further oddities e.g.

        self.summaries[0] += numpy.dot(items.reshape(items.shape[0], ), weights)
        self.summaries[1] += numpy.log(items.reshape(items.shape[0], )).dot(weights)
        self.summaries[2] += weights.sum()

Now going back to rkrohn's test...

numpy.random.seed(0)
X = numpy.random.gamma(2, 1, size=(500, 1))
X[::2] += 3

model = GeneralMixtureModel.from_samples(GammaDistribution2, 2, X, verbose=True)

x = numpy.arange(0, 15, 0.1)
plt.plot(x, model.probability(x))
plt.plot(x, model.distributions[0].probability(x))
plt.plot(x, model.distributions[1].probability(x))

n, bins, patches = plt.hist(X, bins=100, density=True,
                            range=(0, 15))

outputs the following:

Reloaded modules: GammaDistribution2
[1] Improvement: 3.7536636687063947     Time (s): 0.0
[2] Improvement: 0.5529588246918138     Time (s): 0.0
[3] Improvement: 0.11302495923837341    Time (s): 0.0
[4] Improvement: 0.02689769376490858    Time (s): 0.01562
Total Improvement: 4.4465451464014905
Total Time (s): 0.0156

image  It also seems to work (or at least not break) when used in a mixture model with other distributions and with multivariate data modelled with multiple distributions (as in tutorial 2).

The full custom distribution code if you want to test it out is:

import numpy
import scipy
import scipy.stats
import random

from pomegranate.utils import check_random_state

class GammaDistribution2():

    """
    This distribution represents a gamma distribution, parameterized in the
    alpha/beta (shape/rate) parameterization. ML estimation for a gamma
    distribution, taking into account weights on the data, is nontrivial, and I
    was unable to find a good theoretical source for how to do it, so I have
    cobbled together a solution here from less-reputable sources.
    """

    def __init__(self, alpha, beta, frozen=False):
        self.alpha = alpha
        self.beta = beta
        self.parameters = (self.alpha, self.beta)
        self.d = 1
        self.summaries = [0, 0, 0]
        self.name = "GammaDistribution2"
        self.frozen = frozen

    def __reduce__(self):
        """Serialize distribution for pickling."""
        return self.__class__, (self.alpha, self.beta, self.frozen)

    def log_probability(self, X):
        return scipy.stats.gamma.logpdf(X, self.alpha, scale=1/self.beta)

    def probability(self, X):
        return numpy.exp(self.log_probability(X))

    def sample(self, n=None, random_state=None):
        random_state = check_random_state(random_state)
        return random_state.gamma(self.parameters[0], 1.0 / self.parameters[1], n)

    def fit(self, items, weights=None, inertia=0.0, epsilon=1E-9,
            iteration_limit=1000, column_idx=0):
        """
        Set the parameters of this Distribution to maximize the likelihood of
        the given sample. Items holds some sort of sequence. If weights is
        specified, it holds a sequence of value to weight each item by.
        In the Gamma case, likelihood maximization is necesarily numerical, and
        the extension to weighted values is not trivially obvious. The algorithm
        used here includes a Newton-Raphson step for shape parameter estimation,
        and analytical calculation of the rate parameter. The extension to
        weights is constructed using vital information found way down at the
        bottom of an Experts Exchange page.
        Newton-Raphson continues until the change in the parameter is less than
        epsilon, or until iteration_limit is reached
        See:
            http://en.wikipedia.org/wiki/Gamma_distribution
            http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html
        """

        self.summarize(items, weights, column_idx)
        self.from_summaries(inertia, epsilon, iteration_limit)

    def summarize(self, items, weights=None):
        """
        Take in a series of items and their weights and reduce it down to a
        summary statistic to be used in training later.
        """

        if len(items) == 0:
            # No sample, so just ignore it and keep our old parameters.
            return

        # Make it be a numpy array
        items = numpy.asarray(items)

        if weights is None:
            # Weight everything 1 if no weights specified
            weights = numpy.ones_like(items)
        else:
            # Force whatever we have to be a Numpy array
            weights = numpy.asarray(weights)

        if weights.sum() == 0:
            # Since negative weights are banned, we must have no data.
            # Don't change the parameters at all.
            return

        # Save the weighted average of the items, and the weighted average of
        # the log of the items.
        self.summaries[0] += numpy.dot(items.reshape(items.shape[0], ), weights)
        self.summaries[1] += numpy.log(items.reshape(items.shape[0], )).dot(weights)
        self.summaries[2] += weights.sum()

    def from_summaries(self, inertia=0.0, epsilon=1e-4,
                       iteration_limit=100):
        """
        Set the parameters of this Distribution to maximize the likelihood of
        the given sample given the summaries which have been stored.
        In the Gamma case, likelihood maximization is necesarily numerical, and
        the extension to weighted values is not trivially obvious. The algorithm
        used here includes a Newton-Raphson step for shape parameter estimation,
        and analytical calculation of the rate parameter. The extension to
        weights is constructed using vital information found way down at the
        bottom of an Experts Exchange page.
        Newton-Raphson continues until the change in the parameter is less than
        epsilon, or until iteration_limit is reached
        See:
        http://en.wikipedia.org/wiki/Gamma_distribution
        http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html
        """

        # If the distribution is frozen, don't bother with any calculation
        if self.summaries[2] < 1e-7 or self.frozen == True:
            return

        # First, do Newton-Raphson for shape parameter.

        # Calculate the sufficient statistic s, which is the log of the average
        # minus the average log. When computing the average log, we weight
        # outside the log function. (In retrospect, this is actually pretty
        # obvious.)
        statistic = (numpy.log(self.summaries[0] / self.summaries[2]) -
                     self.summaries[1] / self.summaries[2])

        # Start our Newton-Raphson at what Wikipedia claims a 1969 paper claims
        # is a good approximation.
        # Really, start with new_shape set, and shape set to be far away from it
        shape = float("inf")

        if statistic != 0:
            # Not going to have a divide by 0 problem here, so use the good
            # estimate
            new_shape = (3 - statistic + numpy.sqrt((statistic - 3) ** 2 + 24 *
                                               statistic)) / (12 * statistic)
            if statistic == 0 or new_shape <= 0:
                # Try the current shape parameter
                new_shape = self.parameters[0]

        # Count the iterations we take
        iteration = 0

        # Now do the update loop.
        # We need the digamma (gamma derivative over gamma) and trigamma
        # (digamma derivative) functions. Luckily, scipy.special.polygamma(0, x)
        # is the digamma function (0th derivative of the digamma), and
        # scipy.special.polygamma(1, x) is the trigamma function.

        while abs(shape - new_shape) > epsilon and iteration < iteration_limit:
            shape = new_shape

            new_shape = shape - (numpy.log(shape) -
                                 scipy.special.polygamma(0, shape) -
                                 statistic) / (1.0 / shape - scipy.special.polygamma(1, shape))

            # print new_shape, scipy.special.polygamma(1, shape)

            # Don't let shape escape from valid values
            if abs(new_shape) == float("inf") or new_shape == 0:
                # Hack the shape parameter so we don't stop the loop if we land
                # near it.
                shape = new_shape

                # Re-start at some random place.
                new_shape = random.random()

            iteration += 1

        # Might as well grab the new value
        shape = new_shape

        # Now our iterative estimation of the shape parameter has converged.
        # Calculate the rate parameter
        rate = 1.0 / (1.0 / (shape * self.summaries[2]) * self.summaries[0])

        # Get the previous parameters
        prior_shape, prior_rate = self.parameters

        # Calculate the new parameters, respecting inertia, with an inertia
        # of 0 being completely replacing the parameters, and an inertia of
        # 1 being to ignore new training data.
        self.alpha = prior_shape*inertia + shape*(1.0-inertia)
        self.beta = prior_rate*inertia + rate*(1.0-inertia)
        self.summaries = [0, 0, 0]

    def clear_summaries(self):
        """Clear the summary statistics stored in the object."""
        self.summaries = [0, 0, 0]

    @classmethod
    def from_samples(cls, X, weights=None):
        d = GammaDistribution2(0, 0)
        d.summarize(X, weights)
        d.from_summaries()
        return d

    @classmethod
    def blank(cls):
        return GammaDistribution2(0, 0)
jmschrei commented 4 years ago

Should be fixed now.