Mixture Density Network

rpinsler commented 9 years ago

Hi all,

I try to build a MDN-like model with keras. This basically adds a GMM on top of a network, which can be modeled using appropriate activation functions, i.e. for a D-dimensional output and M mixture components, we would apply:

linear activation for the means (M*D outputs)
exponential activation for the variances (M outputs, assuming radial covariance matrix)
softmax activation for the weights (M outputs)

I tried to model this using Graph but the only thing I could come up with is to split the (intermediate) output of the network into three dense layers with M*D, M and M outputs, respectively, and then apply the activations mentioned above. However, I don't want to learn separate weights, but rather apply the three activations to different parts of the output (since this is typically done if I understand it correctly). I had a look into the Lambda layer too, but then again it seems that indexing in Theano isn't straightforward as well.

Any ideas?

rpinsler commented 9 years ago

I came up with a solution, but don't know how to properly pass the number of mixing coefficients, M, as a parameter. The code looks as follows:

M = 3
model_gmm = Sequential()
[...]
model_gmm.add(Dense((M+2)*input_dim))
model_gmm.add(Lambda(gmm_activation(M))
model_gmm.compile(loss=gmm_loss(M), optimizer='adam')

def gmm_activation(M):
  """
  GMM-like activation function.
  Assumes that input has (M+2)*D dimensions, where D is the dimensionality 
  of the target data. The first M*D features are treated as means, the next 
  M features as variances and the last M features as mixture components of the GMM. 
 """

  def gmm_activation_fn(x): 
    D = T.shape(x)[1]/(M+2)
    return T.concatenate([x[:,:M*D-1], T.exp(x[:,M*D:2*M*D-1]), T.nnet.sigmoid(x[:,2*M*D:])], axis=1)

def gmm_loss(M):
  """
  GMM loss function.
  Assumes that y_pred has (M+2)*D dimensions and y_true has D dimensions.
  The first M*D features are treated as means, the next M features as 
  variances and the last M features as mixture components of the GMM. 
  """
  def loss(m, M, y_true, y_pred):
    D = T.shape(y_true)[1]
    return (y_pred[:,(D+1)*M+m]/y_pred[:,D*M+m]) \
      * T.exp(-T.sum(T.sqr(y_pred[:,D*m:(m+1)*D] - y_true))/(2*y_pred[:,D*M+m]**2))

  def gmm_loss_fn(y_true, y_pred):
    seq = T.arange(M)
    result, _ = theano.scan(fn=loss, 
      outputs_info=None, 
      sequences=seq, 
      non_sequences=[M, y_true, y_pred])
    return result.sum()

  return gmm_loss_fn

Note that for both the activation and the loss function I return another function handle, which takes the correct arguments (as specified by the keras API). However, this doesn't work with the Lambda layer:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 8, in <module>
  File "build/bdist.linux-x86_64/egg/keras/models.py", line 359, in compile
  File "build/bdist.linux-x86_64/egg/keras/layers/containers.py", line 74, in get_output
  File "build/bdist.linux-x86_64/egg/keras/layers/core.py", line 1022, in get_output
TypeError: arg 5 (closure) must be tuple

Are there better ways to do this?

yosipk commented 8 years ago

Haven't tested it so can't say if this solves your problem, but I'd use partial from functools [https://docs.python.org/2/library/functools.html]

from functools import partial M = 3 gmm_used_activation = partial(gmm_activation, M) model_gmm = Sequential() [...] model_gmm.add(Dense((M+2)*input_dim)) model_gmm.add(Lambda(gmm_used_activation)) model_gmm.compile(loss=gmm_loss(M), optimizer='adam')

rpinsler commented 8 years ago

Thanks for the suggestion. Actually I tried that as well, but it leads to a similar error. I think I have to write a completely new layer which I can properly parameterize.

yosipk commented 8 years ago

Thanks.

rpinsler commented 8 years ago

I've now implemented something that compiles and seems to work:

M = 3
model_gmm = Sequential()
[...]
model_gmm.add(Dense((input_dim+2)*M))
model_gmm.add(GMMActivation(M))
model_gmm.compile(loss=gmm_loss, optimizer='adam')

class GMMActivation(Layer):
    """
    GMM-like activation function.
    Assumes that input has (D+2)*M dimensions, where D is the dimensionality of the 
    target data. The first M*D features are treated as means, the next M features as 
    standard devs and the last M features as mixture components of the GMM. 
    """
    def __init__(self, M, **kwargs):
        super(GMMActivation, self).__init__(**kwargs)
        self.M = M

    def get_output(self, train=False):
      X = self.get_input(train)
      D = T.shape(X)[1]/self.M - 2
      # leave mu values as they are since they're unconstrained
      # scale sigmas with exp, s.t. all values are non-negative 
      X = T.set_subtensor(X[:,D*self.M:(D+1)*self.M], T.exp(X[:,D*self.M:(D+1)*self.M]))
      # scale alphas with softmax, s.t. that all values are between [0,1] and sum up to 1
      X = T.set_subtensor(X[:,(D+1)*self.M:(D+2)*self.M], T.nnet.softmax(X[:,(D+1)*self.M:(D+2)*self.M]))
      return X

    def get_config(self):
        config = {"name": self.__class__.__name__,
                  "M": self.M}
        base_config = super(GMMActivation, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

def gmm_loss(y_true, y_pred):
  """
  GMM loss function.
  Assumes that y_pred has (D+2)*M dimensions and y_true has D dimensions. The first 
  M*D features are treated as means, the next M features as standard devs and the last 
  M features as mixture components of the GMM. 
  """
  def loss(m, M, D, y_true, y_pred):
    mu = y_pred[:,D*m:(m+1)*D]
    sigma = y_pred[:,D*M+m]
    alpha = y_pred[:,(D+1)*M+m]
    return (alpha/sigma) * T.exp(-T.sum(T.sqr(mu-y_true),-1)/(2*sigma**2))

  D = T.shape(y_true)[1]
  M = T.shape(y_pred)[1]/(D+2)
  seq = T.arange(M)
  result, _ = theano.scan(fn=loss, outputs_info=None, 
    sequences=seq, non_sequences=[M, D, y_true, y_pred])
  return -T.log(result.sum(0))

rpinsler commented 8 years ago

It seems that there's still some issue. This happened after like 16 iterations:

 6400/10520 [=================>............] - ETA: 34s - loss: 2.0416
 6528/10520 [=================>............] - ETA: 33s - loss: inf   
 6656/10520 [=================>............] - ETA: 32s - loss: nan
...

rpinsler commented 8 years ago

Was probably just a numeric problem with the optimizer.

ylqfp commented 8 years ago

mark

yunzhou commented 8 years ago

I actually merge the scale sigmas and scale alpha in the gmm_loss. And use a normal dense layer then the numeric problem is gone. I reimplement the example of mixture density networks in numpy


from keras.models import Sequential
from keras.layers.core import Dense
from keras import backend as back
# TODO: This is only implemented for theano, rewrite it using keras.backend (as an exercise).
import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1067)

def softmax(x):
    # softmaxes the columns of x
    # z = x - np.max(x, axis=0, keepdims=True) # for safety
    e = np.exp(x)
    en = e / np.sum(e, axis=1, keepdims=True)
    return en

'''
pi : n,m m components for mixture model
mu:  n,m*d for each comp, there are d mean for d features
sigma: n,m*d for each comp, there are should d*d cov for d features, but we only consider the diag here
return samples from mixture guassian model
'''

def draw_sample_from_mixture_guassian(n, d, out_pi, out_mu, out_sigma):
    result = np.zeros((n, d))
    m = out_pi.shape[1]
    for i in range(n):
        c = int(np.random.choice(range(m), size=1, replace=True, p=out_pi[i, :]))
        mu = out_mu[i, c * d:(c + 1) * d].ravel()
        sig = np.diag(out_sigma[i, c * d:(c + 1) * d])
        sample_c = np.random.multivariate_normal(mu, sig ** 2, 1).ravel()
        result[i, :] = sample_c
    return result

def drawContour(m, X, Y):
    n = 50
    xx = np.linspace(0, 1, n)
    yy = np.linspace(0, 1, n)
    xm, ym = np.meshgrid(xx, yy)
    logps = np.zeros((xm.size, 1))
    xm1 = xm.reshape(xm.size, 1)
    ym1 = ym.reshape(ym.size, 1)
    for i in range(xm.size):
        logps[i] = m.evaluate(xm1[i], ym1[i])
    plt.figure(figsize=(10, 10))
    plt.scatter(X, Y, color='g')
    plt.contour(xm, ym, np.reshape(logps, (n, n)), levels=np.linspace(logps.min(), logps.max(), 20))
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title('3-component Gaussian Mixture Model for P(y|x)')

def drawSample(model, X, Y, M, title_str=''):
    y_pred = np.zeros_like(Y)
    param_pred = model.predict(X)
    if len(X.shape) > 1:
        D = X.shape[1]
    else:
        D = 1
    al = softmax(param_pred[:, 2 * D * M:])
    mu = param_pred[:, :D * M]
    sig = np.exp(param_pred[:, D * M: 2 * D * M])
    y_pred = draw_sample_from_mixture_guassian(y_pred.size, D, al, mu, sig)
    rmse = np.sqrt(((y_pred - Y) ** 2).mean())
    print("rmse : " + str(rmse))
    plt.figure(figsize=(10, 10))
    plt.plot(X, y_pred, 'b+')
    plt.plot(X, Y, 'g-.')
    plt.legend(['pred', 'truth'])
    plt.title(title_str)

def np_log_sum_exp(x, axis=None):
    x_max = np.max(x, axis=axis, keepdims=True)
    return np.log(np.sum(np.exp(x - x_max), axis=axis, keepdims=True,dtype=np.float32)) + x_max
'''
avoid nan when comput log_sum_exp in mdn loss function
https://amjadmahayri.wordpress.com/2014/04/30/mixture-density-networks/
https://github.com/Theano/Theano/issues/1563
'''
def log_sum_exp(x, axis=None):
    x_max = T.max(x, axis=axis, keepdims=True)
    return T.log(T.sum(T.exp(x - x_max), axis=axis, keepdims=True)) + x_max

def np_loss(true, parameters):
    two_pi = 2 * np.pi
    D = true.shape[1]
    M = parameters.shape[1] / (2 * D + 1)

    mu = parameters[:, 0: D * M]
    sig = np.exp(parameters[:, D * M:2 * D * M])
    al = softmax(parameters[:, 2 * D * M:])

    n, k = mu.shape  # number of mixture components
    z = np.zeros((n, M))
    for c in range(M):
        z[:,c:c+1] = np.sum(((true - mu[:, c * D:(c + 1) * D]) / sig[:, c * D:(c + 1) * D]) ** 2, axis=1, keepdims=True, dtype=np.float32)/ -2.0
        normalizer = np.prod(sig[:, c * D:(c + 1) * D], axis=1, keepdims=True, dtype=np.float32) * two_pi
        z[:,c:c+1] = z[:,c:c+1] + np.log(al[:, c:c + 1]) - np.log(normalizer)
    lp = np_log_sum_exp(z,1)

    # print lp
    loss = -np.sum(lp) / n
    return loss

def neg_log_normal_mixture_likelihood(true, parameters):
    D = T.shape(true)[1]
    M = T.shape(parameters)[1] / (2 * D + 1)

    means = parameters[:, : D * M]
    sigmas = T.exp(parameters[:, D * M: 2 * D * M])
    weights = T.nnet.softmax(parameters[:, 2 * D * M:])
    two_pi = 2 * np.pi

    def component_normal_likelihood(i, mus, sis, als, tr):
        mu = mus[:, i * D:(i + 1) * D]
        sig = sis[:, i * D:(i + 1) * D]
        al = als[:, i]

        z = T.sum(((true - mu) / sig) ** 2, axis=1)/ -2.0
        normalizer = (T.prod(sig, axis=1) * two_pi)
        z += T.log(al) - T.log(normalizer)
        return z

    r, _ = theano.scan(
            fn=component_normal_likelihood,
            outputs_info=None,
            sequences=T.arange(M),
            non_sequences=[means, sigmas, weights, true])
    lp = log_sum_exp(r,0)
    loss = -T.sum(lp) / T.shape(true)[0]
    return loss

# generate some 1D regression data (reproducing Bishop book data, page 273).
# Note that the P(y|x) is not a nice distribution. E.g. it has three modes for x ~= 0.5
N = 200
X = np.linspace(0, 1, N)
Y = X + 0.3 * np.sin(2 * 3.1415926 * X) + np.random.uniform(-0.05, 0.05, N)
X, Y = Y, X

# build nerual network
M = 3
input_output_size = 1
hidden_size = 30
model_gmm = Sequential()
model_gmm.add(Dense(hidden_size, input_dim=input_output_size))
model_gmm.add(Dense((2 * input_output_size + 1) * M))
model_gmm.compile(loss=neg_log_normal_mixture_likelihood, optimizer='rmsprop')

get_3rd_layer_output = back.function([model_gmm.layers[0].input],
                                     [model_gmm.layers[1].output])

out_param_value = get_3rd_layer_output([X.reshape(N,input_output_size)])[0]
print np_loss(Y.reshape(N,input_output_size), out_param_value)

print "before fit"
drawSample(model_gmm, X, Y, M, "sample before training")
drawContour(model_gmm, X, Y)
plt.show()
print "fitting....."
history = model_gmm.fit(X, Y, batch_size=N, nb_epoch=10000)
print "after fit"
drawSample(model_gmm, X, Y, M, "sample after training")
drawContour(model_gmm, X, Y)
plt.show()

erlebach commented 8 years ago

Hi yunzhou,

According to Keras documentation, the arguments to a loss function, in your case, "true" and "parameters", in

def neg_log_normal_mixture_likelihood(true, parameters):

should have the same dimensionality. I printed the shape of the two arguments in the function np_loss(), and they were different. The actual parameters in neg_log_normal_mixture_likelihood probably do not have memory allocated yet when the function is first called, so I cannot ascertain the argument dimensionality at that point.

Any clarification on this would be great. Thanks!

Gordon

yunzhou commented 8 years ago

Yes the "true" and "parameters" are in different shapes. It is a dirty and hacky way to implement the mixture density network layer.

The "true" is the samples in the dataset. But since we want to use density network layer to generate new data, we are optimizing the parameters of the mixture layer. The truth dimension is only n*2, since they are points on XY plane. But we using the mixture gaussian model to generate these points.

So given our training dataset, we want to find out the parameters of the mixture gaussian model are most likely to generate these points.

erlebach commented 8 years ago

Great! Thanks for the reply. You confirmed my hypothesis.

I do have another question though. In your loss function, you are not summing Gaussians: I do not see an exp(...) function used.

Finally, what is the loss function when batchsize is greater than 1? Is it simply a sum of terms of the form -log(\sum{mixtures} a(k) * N(mu(k),sig(k)) ?

Thanks,

Gordon

On Sun, Jun 26, 2016 at 10:06 PM, ZY0627 notifications@github.com wrote:

Yes the "true" and "parameters" are in different shapes. It is a dirty and hacky way to implement the mixture density network layer.

The "true" is the samples in the dataset. But since we want to use density network layer to generate new data, we are optimizing the parameters of the mixture layer. The truth dimension is only n*2, since they are points on XY plane. But we using the mixture gaussian model to generate these points.

So given our training dataset, we want to find out the parameters of the mixture gaussian model are most likely to generate these points.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/1061#issuecomment-228639779, or mute the thread https://github.com/notifications/unsubscribe/AAT0ZD0YBslgI5K0uwj4gJrOAam_l06Nks5qPzALgaJpZM4GnRee .

Gordon Erlebacher Chair, Department of Scientific Computing

keras-team / keras

Mixture Density Network #1061