Mixture of Gaussian distribution

coschroeder commented 4 years ago

Hi, I am struggling with creating a mixture of Gaussian distribution. I tried out several ways but none of it seemed to be correct. Here is one possibility. But I think I am missing something, maybe you can help out?

from chaospy import Dist
from scipy import special
import numpy as np
import numpy

import matplotlib.pyplot as plt
import seaborn as sns

# Define MoG class
class MoG(Dist):

    def __init__(self, loc=[[-5, -5],[5,5]], scale=[[[1, .9], [.9, 1]], [[1, -.9], [-.9, 1]]], a=[0.7,0.3]):
        loc = numpy.asfarray(loc)        
        scale = numpy.asfarray(scale)
        a = numpy.asfarray(a)
        assert np.isclose(np.sum(a),1)
        assert len(loc) == len(scale)
        self._repr = {"loc": loc.tolist(), "scale": scale.tolist()}

        C = np.array([numpy.linalg.cholesky(scale[i]) for i in range(len(a))])
        Ci = np.array([ numpy.linalg.inv(C[i]) for i in range(len(a))])
        Dist.__init__(self, C=C, Ci=Ci, loc=loc, a=a)

    def _cdf(self, x, C, Ci, loc, a):
        """
        taking the weighted sum over all components
        """
        out = a[0]* special.ndtr(numpy.dot(Ci[0], (x.T-loc[0].T).T))
        for i in range(1, len(a)):
            out += a[i]* special.ndtr(numpy.dot(Ci[i], (x.T-loc[i].T).T))
        return out

    def _bnd(self, x, C, Ci, loc, a):
        """
        boundaries are not yet correctly scaled. 
        But it should work for distributions in right range.
        """
        scale = numpy.sqrt(numpy.diag(numpy.dot(C[0],C[0].T)))
        lo,up = numpy.zeros((2,)+x.shape)
        lo.T[:] = (-20+loc[0])# (-7.5*scale+loc[0])
        up.T[:] = (20+loc[0]) #(7.5*scale+loc[0])
        return lo, up

    def __len__(self):
        return len(self.prm["C"])

# create with np for comparison
def np_MoG(loc,scale,a,n=1):
    choices = np.random.choice(len(a), size=n, replace=True, p=a)
    counts = [sum(choices==i) for i in range(len(a)) ]
    samples=[]
    for i in range(0,len(a)):
        samples.append(np.random.multivariate_normal(loc[i], scale[i], counts[i]))

    samples = np.array(samples)
    samples = np.vstack(samples)
    np.random.shuffle(samples)

    return samples

And creating samples:

# create samples with cp
loc=[[-5, -5],[5,5]] 
scale=[[[1, 0.9], [0.9, 1]], [[1, 0], [0, 1]]]
a=[0.5,0.5]

R = MoG(loc=loc, scale=scale, a=a)
samples = R.sample(10000)

# reference samples with np
samples_np = np_MoG(loc,scale,a,n=10000)

# and plot everything
plt.figure(1, figsize=(12,8))
plt.subplot(2,2,1)
sns.distplot(samples[0], bins=np.arange(-10,20,0.5))
sns.distplot(samples_np[:,0], bins=np.arange(-10,20,0.5))
plt.subplot(2,2,4)
sns.distplot(samples[1], bins=np.arange(-10,20,0.5),label='cp samples')
sns.distplot(samples_np[:,1], bins=np.arange(-10,20,0.5), label='np samples')
plt.legend()
plt.subplot(2,2,2)
sns.kdeplot(samples[1], samples[0])
sns.kdeplot(samples_np[:,1], samples_np[:,0])

plt.xlim(-10,20)

jonathf commented 4 years ago

First of all: This is awesome! I'd love to add it to the library. If you are interested, that is.

As for the implementation.

If you haven't guessed it already, the _cdf name is lying a bit. It isn't cumulative distribution function, but a forward Rosenblatt. In the case of a multivariate Gaussian distribution, this can be achieved using the inverse of a Cholesky decomposition. But this is a solution specific for Gaussian, which does not hold for other distribution. Other distributions we must have to do the full probability decomposition. But it should be quite possible though.

To exemplify how to approach the problem, below is an theoretical calculations that needs in 3-dimensional. I don't have the full answer how that will look in practice, but I imagine that by tinkering a little bit, it should be possible to get the form on something both computable and generalizable in multiple dimensions.

PDF:
    p(x, y, z) = w0 p0(x, y, z) + w1 p1(x, y, z) + w2 p2(x, y, z)
CDF:
    P(x, y, z) = w0 P0(x, y, z) + w1 P1(x, y, z) + w2 P2(x, y, z)

P0 ~ N(mu[0], sigma[0])
P1 ~ N(mu[1], sigma[1])
P2 ~ N(mu[2], sigma[2])

p(x, y) = int p(x, y, z) dz
        = w0 r0(x, y) + w1 r1(x, y) + w2 r2(x, y)
P(x, y) = int_x int_y p(x, y) dx dy
        = w0 R0(x, y) + w1 R1(x, y) + w2 R2(x, y)

R0 ~ N(mu[0][:2], sigma[0][:2, :2])
R1 ~ N(mu[1][:2], sigma[1][:2, :2])
R2 ~ N(mu[2][:2], sigma[2][:2, :2])

p(x) = int int p(x, y, z) dy dz
     = w0 q0(x) + w1 q1(x) + w2 q2(x)
P(x) = int_x p(x) dx
     = w0 Q0(x) + w1 Q1(x) + w2 Q2(x)

Q0 ~ N(mu[0][0], sigma[0][0, 0])
Q1 ~ N(mu[1][0], sigma[1][0, 0])
Q0 ~ N(mu[2][0], sigma[2][0, 0])

p(y | x) = p(x, y) / p(x)
P(y | x) = int_y p(x, y) / p(x) dy
         = (w0 int_y r0(x, y) dy + w1 int_y r1(x, y) dy + w2 int_y r2(x, y) dy)
          / (w0 q0(x) + w1 q1(x) + w2 q2(x))

p(z | x, y) = p(x, y, z) / p(x, y)
P(z | x, y) = int_z p(x, y, z) / p(x, y) dz
            = (w0 int_z p0(x, y, z) + w1 int_z p1(x, y, z) + w2 int_z p2(x, y, z))
             / (w0 r0(x, y) + w1 r1(x, y) + w2 r2(x, y))

output = [P(x), P(y | x), P(z | x, y)]

Please let me know, if something is unclear.

coschroeder commented 4 years ago

UPDATED

Thanks for your fast response and your interest.

Yes, I was aware that the _cdf is not the cumulative density, but I didn't know how to deal with it exactly.

I tried to implement your description but it is not working yet. Maybe you can give me again some input on the following:

do you spot an error? the results are not yet correct.
I calculate int_y r0(x, y) dy. with the use of marginals and conditional distribution. This should be faster then the scp.integrate bit.
I am confused with the shapes the output has. only 1 sample at a time works. what would be the correct shape of the output?


from chaospy import Dist
from scipy import special
import numpy as np
import numpy
import scipy as scp
import matplotlib.pyplot as plt
import seaborn as sns

"""
Helper functions for conditional Normal distributions
"""
def get_cond_loc(loc, scale, i, xb):
    """
    i: conditioning on the first i params (param_a[i:]|param_b[0:i]) p(a|b)
    [Bishop: page 87]
    """
    assert len(xb) == i
    Caa = scale[i:,i:]
    Cbb = scale[0:i,0:i]
    Cab = scale[i:,0:i]
    locb = loc[:i]
    loca = loc[i:]
    Cbb_inv = np.linalg.inv(Cbb) #np.linalg.inv(np.linalg.cholesky(Cbb))
    loca_b = loca + np.dot(np.dot(Cab, Cbb_inv),(xb-locb))
    return loca_b

def get_cond_cov(scale,i):
    """
    i: conditioning on the first i params (param_a[i:]|param_b[0:i])
    [Bishop: page 87]
    """
    Caa = scale[i:,i:]
    Cbb = scale[0:i,0:i]
    Cab = scale[i:,0:i]
    Cbb_inv = np.linalg.inv(Cbb)#(np.linalg.cholesky(Cbb))

    Ca_b = Caa - np.dot(Cab, np.dot(Cbb_inv, Cab.T))

    return Ca_b

"""
Define MoG class
"""
class MoG(Dist):

    def __init__(self, loc=[[-5, -5],[5,5]], scale=[[[1, .9], [.9, 1]], [[1, -.9], [-.9, 1]]], a=[0.7,0.3]):
        loc = numpy.asfarray(loc)        
        scale = numpy.asfarray(scale)
        a = numpy.asfarray(a)
        assert np.isclose(np.sum(a),1)
        assert len(loc) == len(scale)
        self._repr = {"loc": loc.tolist(), "scale": scale.tolist()}

        C = np.array([numpy.linalg.cholesky(scale[i]) for i in range(len(a))])
        Ci = np.array([ numpy.linalg.inv(C[i]) for i in range(len(a))])
        #m_cond = get_cond_m(loc)
        Dist.__init__(self, C=C, Ci=Ci, loc=loc, scale=scale, a=a)

    def _cdf(self, x, C, Ci, loc, scale,  a):
        N = len(loc[0]) # dimension
        P = np.zeros(N) # weighted sum cdf per conditional
        p = np.ones(N-1) # weighted sum pdf per conditional

        # iterating over all dimensions
        for i in range(0,N):
            Q = np.zeros(len(a))
            q = np.zeros(len(a))
            # iterating over components
            for comp in range(len(a)):
                # first part (pure weighted marginal cdf)
                if i==0:
                    Q[comp] = scp.stats.multivariate_normal(loc[comp][0],scale[comp][0,0]).cdf(x[0])

                # "higher" conditionals
                else:
                    '''
                    # define function to integrate int_y r(x,y) dy
                    fun = lambda z: scp.stats.multivariate_normal(loc[comp][:i+1],scale[comp][:i+1,:i+1]).pdf(np.append(x[:i],z))
                    # specify integration boundary
                    bound = -7.5*scale[comp][i,i]+loc[comp][i]
                    Q[comp] = scp.integrate.quad(fun,bound,x[i])[0]
                    '''
                    # NEW: use conditional distributions:  int_y r(x,y) dy = r(x) int_y r(y|x) dy 
                    cond_loc = get_cond_loc(loc[comp], scale[comp], i, x[:i])
                    cond_scale = get_cond_cov(scale[comp],i)
                    Q[comp] = (scp.stats.multivariate_normal(loc[comp][:i],scale[comp][:i,:i]).pdf(x[:i])
                               * scp.stats.multivariate_normal(cond_loc, cond_scale).cdf(x[i]) )

                # evaluate these higher conditionals. needed for normalization in next round
                if i<N-1:
                    q[comp] = scp.stats.multivariate_normal(loc[comp][:i+1],scale[comp][:i+1,:i+1]).pdf(x[:i+1])

            if i <N-1:
                p[i] = np.sum(a*q)

            P[i] = np.sum(a*Q)/p[i-1] # ok also for first entry, since p[-1]=1

        return np.reshape(P, np.shape(x))

    def _bnd(self, x, C, Ci, loc, scale,  a):
        """
        take minimum and maximum of components
        """
        scale = [numpy.sqrt(numpy.diag(numpy.dot(C[i],C[i].T))) for i in range(len(a))]
        lo_all = np.array([-7.5*scale[i]+loc[i] for i in range(len(a))])
        up_all = np.array([7.5*scale[i]+loc[i] for i in range(len(a))])

        lo,up = numpy.zeros((2,)+x.shape)    
        lo.T[:] =  np.min(lo_all, axis=0) #(-7.5*scale+loc[0])
        up.T[:] =  np.max(up_all, axis=0) #(7.5*scale+loc[0])
        return lo, up

    def __len__(self):
        return len(self.prm["C"])

# create samples with cp
loc=[[-1, -5],[1,2]] 
scale=[[[1, 0.9], [0.9, 1]], [[1, 0], [0, 1]]]
a=[0.5,0.5]

R = MoG(loc=loc, scale=scale, a=a)

# shapes not yet correct. only sampl(1) works
samples = np.zeros((10,2))
for i in range(10):
    temp = R.sample(1)
    samples[i,0] = temp[0][0]
    samples[i,1] =temp[1][0]

Thanks a lot!

jonathf commented 4 years ago

Great. I'll take a look at it closer to the weekend. I'll give you feedback after that.

jonathf commented 4 years ago

After getting into the nitty-gritty of (2), I realized that I made that part more complicated than I had to.

s0(x) = int_y r0(x, y) dy

S0 ~ N(mu[0][0], sigma[0][0, 0])

In other words, just take the an appropriate slice of mu and sigma.

Not sure why I made it more complicated than it needed to by.

jonathf commented 4 years ago

3) The shape of input and output should be the same. Also len(x) == len(self) should hold.

coschroeder commented 4 years ago

ad (2): Not sure which one is correct. in v1 you would take the marginal over x but integrate p(y|x) up to the specified y. in your last comment it would not depend on y at all. So I am just not sure which term is needed for the Rosenblatt transformation.

jonathf commented 4 years ago

Ah right, sorry, too quick on the trigger finger. You are right. It is not a full integration, but a partial one. Gritty details is still required, unfortunatly.

jonathf commented 4 years ago

Second look: Looks like we need to evaluate on a truncated multivariate Gaussian distribution I can make this, but I will need a little time to get it in place. I'll get to it before the end of the year. I'll make the MoG distribution as well when I am at it, as the truncated distribution should make the job a lot easier to create MoG.

jonathf commented 4 years ago

Unfortunately, the task is bigger than I first realized, and my limited time availability has run out. It should still be possible to do, but not right now. I have added the task to my long term goals, as I do want this feature as well.

Hopefully I will get more time to work on this soon.

jonathf / chaospy

Mixture of Gaussian distribution #187