Closed rkrohn closed 4 years ago
I think this is a duplicate of #490. Can you try pulling the latest code?
I'm still getting the same error after installing using python3 setup.py install --user
. Is there something else I'm missing?
I haven't pushed the change to pip yet, so there isn't a new version. What I meant was pulling the latest code frm GitHub and installing it manually. What OS are you on? Do you already have Cython installed?
I cloned the repo, and installed it manually using python3 setup.py install --user
as described in https://pomegranate.readthedocs.io/en/latest/install.html. Working on a shared server that runs Ubuntu 14.04.5, Cython is installed. The manual install seemed to work fine, apart from some warnings, and I get this message at the end: Installed <...>/pomegranate-0.10.0-py3.4-linux-x86_64.egg
. I also just tried on a Ubuntu 16.04.5 system and am still having the same problem.
I'm sure this is just an install issue on my end, so feel free to close if you'd like.
I'm getting the same error. I've cloned the repo and built manually. I'm on Mac OS X 10.11.6, using the anaconda distribution.
Me too I am dealing with the same error. I pulled the latest code from git and installed it on a Windows 10 machine. I feel like this is a different issue than dealt with in #490 , as the chuck of code posted in the last reply works for me (after dealing with some dependency issues concerning the networkx-package). It does indeed nicely separate the two gamma functions and it is worth noting that both the reshape(-1,1) and reshape(1,-1) produce the desired results.
However what I really want to accomplish, the GMM of gamma's and not the HMM does not work. I have a one-dimenisonal array and thus will reshape this. I tried both reshape(-1,1) and reshape (1,-1) The first one does still give me the mentioned error. When I use reshape(1,-1) thus resulting in a vector with size(1,something), this does not give me an error but a lot of RunTimeWarnings (invalid value, divide by zero,...) and all of the fitted GammaDistributions have NaN parameters.
I am not sure what I am doing wrong, or the bug mentioned in #490 still persists in some way.
It may be helpful if you pull the most recent code, which supports custom distributions, and see if you can code a gamma distribution that does work. If you can find the bug I'd be happy to include it asap in the built-in one. Sorry for the delay!
This tutorial may be helpful for implementing custom distributions: https://github.com/jmschrei/pomegranate/blob/master/tutorials/C_Feature_Tutorial_5_Custom_Distributions.ipynb
Hi Jacob, thanks for the reply. I pulled the latest version of the code, yet the bug still persists. For now however I will use the HMM to fit two gamma-distributions, as described in #490 . Later on, I will maybe try to implement my own custom gamma distribution or try to look for the bug, but for now the HMM fits my needs.
Hi everyone, I also hit this seemingly nonsensical ValueError while trying to fit a GMM of gamma distributions with the latest pomegranate version available from conda-forge (0.11.0). I don't want to use the HMM workaround so I had a poke around to see if I could find the bug instead.
It appears that GammaDistribution.pyx has two summarize functions, one in pure python and one in C. The ValueError can be traced to the python version at the point when the summaries are calculated and probably arises because numpy.dot won't evaluate the dot product of a (n, 1) and (n, ) array (data and weights respectively).
I cobbled together a pure python custom univariate Gamma Distribution called GammaDistribution2 for testing, basically by hastily replacing the small amount of C from GammaDistribution.pyx and adding in any missing requirements specified in tutorial 5. Passing numpy.dot a reshaped (n, ) copy of the data (items) seems to work for me without causing any obvious further oddities e.g.
self.summaries[0] += numpy.dot(items.reshape(items.shape[0], ), weights)
self.summaries[1] += numpy.log(items.reshape(items.shape[0], )).dot(weights)
self.summaries[2] += weights.sum()
Now going back to rkrohn's test...
numpy.random.seed(0)
X = numpy.random.gamma(2, 1, size=(500, 1))
X[::2] += 3
model = GeneralMixtureModel.from_samples(GammaDistribution2, 2, X, verbose=True)
x = numpy.arange(0, 15, 0.1)
plt.plot(x, model.probability(x))
plt.plot(x, model.distributions[0].probability(x))
plt.plot(x, model.distributions[1].probability(x))
n, bins, patches = plt.hist(X, bins=100, density=True,
range=(0, 15))
outputs the following:
Reloaded modules: GammaDistribution2
[1] Improvement: 3.7536636687063947 Time (s): 0.0
[2] Improvement: 0.5529588246918138 Time (s): 0.0
[3] Improvement: 0.11302495923837341 Time (s): 0.0
[4] Improvement: 0.02689769376490858 Time (s): 0.01562
Total Improvement: 4.4465451464014905
Total Time (s): 0.0156
 It also seems to work (or at least not break) when used in a mixture model with other distributions and with multivariate data modelled with multiple distributions (as in tutorial 2).
The full custom distribution code if you want to test it out is:
import numpy
import scipy
import scipy.stats
import random
from pomegranate.utils import check_random_state
class GammaDistribution2():
"""
This distribution represents a gamma distribution, parameterized in the
alpha/beta (shape/rate) parameterization. ML estimation for a gamma
distribution, taking into account weights on the data, is nontrivial, and I
was unable to find a good theoretical source for how to do it, so I have
cobbled together a solution here from less-reputable sources.
"""
def __init__(self, alpha, beta, frozen=False):
self.alpha = alpha
self.beta = beta
self.parameters = (self.alpha, self.beta)
self.d = 1
self.summaries = [0, 0, 0]
self.name = "GammaDistribution2"
self.frozen = frozen
def __reduce__(self):
"""Serialize distribution for pickling."""
return self.__class__, (self.alpha, self.beta, self.frozen)
def log_probability(self, X):
return scipy.stats.gamma.logpdf(X, self.alpha, scale=1/self.beta)
def probability(self, X):
return numpy.exp(self.log_probability(X))
def sample(self, n=None, random_state=None):
random_state = check_random_state(random_state)
return random_state.gamma(self.parameters[0], 1.0 / self.parameters[1], n)
def fit(self, items, weights=None, inertia=0.0, epsilon=1E-9,
iteration_limit=1000, column_idx=0):
"""
Set the parameters of this Distribution to maximize the likelihood of
the given sample. Items holds some sort of sequence. If weights is
specified, it holds a sequence of value to weight each item by.
In the Gamma case, likelihood maximization is necesarily numerical, and
the extension to weighted values is not trivially obvious. The algorithm
used here includes a Newton-Raphson step for shape parameter estimation,
and analytical calculation of the rate parameter. The extension to
weights is constructed using vital information found way down at the
bottom of an Experts Exchange page.
Newton-Raphson continues until the change in the parameter is less than
epsilon, or until iteration_limit is reached
See:
http://en.wikipedia.org/wiki/Gamma_distribution
http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html
"""
self.summarize(items, weights, column_idx)
self.from_summaries(inertia, epsilon, iteration_limit)
def summarize(self, items, weights=None):
"""
Take in a series of items and their weights and reduce it down to a
summary statistic to be used in training later.
"""
if len(items) == 0:
# No sample, so just ignore it and keep our old parameters.
return
# Make it be a numpy array
items = numpy.asarray(items)
if weights is None:
# Weight everything 1 if no weights specified
weights = numpy.ones_like(items)
else:
# Force whatever we have to be a Numpy array
weights = numpy.asarray(weights)
if weights.sum() == 0:
# Since negative weights are banned, we must have no data.
# Don't change the parameters at all.
return
# Save the weighted average of the items, and the weighted average of
# the log of the items.
self.summaries[0] += numpy.dot(items.reshape(items.shape[0], ), weights)
self.summaries[1] += numpy.log(items.reshape(items.shape[0], )).dot(weights)
self.summaries[2] += weights.sum()
def from_summaries(self, inertia=0.0, epsilon=1e-4,
iteration_limit=100):
"""
Set the parameters of this Distribution to maximize the likelihood of
the given sample given the summaries which have been stored.
In the Gamma case, likelihood maximization is necesarily numerical, and
the extension to weighted values is not trivially obvious. The algorithm
used here includes a Newton-Raphson step for shape parameter estimation,
and analytical calculation of the rate parameter. The extension to
weights is constructed using vital information found way down at the
bottom of an Experts Exchange page.
Newton-Raphson continues until the change in the parameter is less than
epsilon, or until iteration_limit is reached
See:
http://en.wikipedia.org/wiki/Gamma_distribution
http://www.experts-exchange.com/Other/Math_Science/Q_23943764.html
"""
# If the distribution is frozen, don't bother with any calculation
if self.summaries[2] < 1e-7 or self.frozen == True:
return
# First, do Newton-Raphson for shape parameter.
# Calculate the sufficient statistic s, which is the log of the average
# minus the average log. When computing the average log, we weight
# outside the log function. (In retrospect, this is actually pretty
# obvious.)
statistic = (numpy.log(self.summaries[0] / self.summaries[2]) -
self.summaries[1] / self.summaries[2])
# Start our Newton-Raphson at what Wikipedia claims a 1969 paper claims
# is a good approximation.
# Really, start with new_shape set, and shape set to be far away from it
shape = float("inf")
if statistic != 0:
# Not going to have a divide by 0 problem here, so use the good
# estimate
new_shape = (3 - statistic + numpy.sqrt((statistic - 3) ** 2 + 24 *
statistic)) / (12 * statistic)
if statistic == 0 or new_shape <= 0:
# Try the current shape parameter
new_shape = self.parameters[0]
# Count the iterations we take
iteration = 0
# Now do the update loop.
# We need the digamma (gamma derivative over gamma) and trigamma
# (digamma derivative) functions. Luckily, scipy.special.polygamma(0, x)
# is the digamma function (0th derivative of the digamma), and
# scipy.special.polygamma(1, x) is the trigamma function.
while abs(shape - new_shape) > epsilon and iteration < iteration_limit:
shape = new_shape
new_shape = shape - (numpy.log(shape) -
scipy.special.polygamma(0, shape) -
statistic) / (1.0 / shape - scipy.special.polygamma(1, shape))
# print new_shape, scipy.special.polygamma(1, shape)
# Don't let shape escape from valid values
if abs(new_shape) == float("inf") or new_shape == 0:
# Hack the shape parameter so we don't stop the loop if we land
# near it.
shape = new_shape
# Re-start at some random place.
new_shape = random.random()
iteration += 1
# Might as well grab the new value
shape = new_shape
# Now our iterative estimation of the shape parameter has converged.
# Calculate the rate parameter
rate = 1.0 / (1.0 / (shape * self.summaries[2]) * self.summaries[0])
# Get the previous parameters
prior_shape, prior_rate = self.parameters
# Calculate the new parameters, respecting inertia, with an inertia
# of 0 being completely replacing the parameters, and an inertia of
# 1 being to ignore new training data.
self.alpha = prior_shape*inertia + shape*(1.0-inertia)
self.beta = prior_rate*inertia + rate*(1.0-inertia)
self.summaries = [0, 0, 0]
def clear_summaries(self):
"""Clear the summary statistics stored in the object."""
self.summaries = [0, 0, 0]
@classmethod
def from_samples(cls, X, weights=None):
d = GammaDistribution2(0, 0)
d.summarize(X, weights)
d.from_summaries()
return d
@classmethod
def blank(cls):
return GammaDistribution2(0, 0)
Should be fixed now.
Running version 0.10.0 installed via pip. I'm trying to fit a GMM consisting of two GammaDistributions.
I can fit the GMM using NormalDistributions:
And the model initializes fine.
But if I try to fit the same data using two GammaDistributions, the initialization fails.
I can also fit a single GammaDistribution, but introducing a GammaDistribution to any combination for the GMM produces the same errors.