How to Implement Bayesian LSTM layers for time-series prediction

behdadahmadi commented 5 years ago

How can I implement and using LSTM layers for time-series prediction with Tensorflow Probability? There is no any layer for RNN Deep learning in TFP layers in tfp.layers

alexv1247 commented 5 years ago

Exactly what I am looking for as well. I hope someone comes up with an approach. You cann have a look at the edward python package. They have a lstm example, which is a good start.

You can have a look at this blogpost https://github.com/kyle-dorman/bayesian-neural-network-blogpost. You can implement whatever nn structure you want in this example. However the epistemic uncertainty is calculated with a mc dropout which can take forever.

kevinykuo commented 5 years ago

You could hook up the RNN sequence output with a (time-distributed) dense variational and then a distribution output.

junpenglao commented 5 years ago

+1 to @kevinykuo. In addition, you can try combining the RNN sequence output with tfp.sts: either using the output as a designmatrix in tfp.sts.*linearregression, or something like mu = rnn_output + sts_model.make_state_space_model and plug mu into a distribution (eg Gaussian). Would be interesting to see what works the best!

JP-MRPhys commented 5 years ago

I had a look at this too this weekend, I am thinking to start from ground up, specially, if you want to implement posterior sharpening, sonnet has a implementation of https://arxiv.org/pdf/1704.02798.pdf, https://github.com/deepmind/sonnet/blob/master/sonnet/examples/brnn_ptb.py

alexv1247 commented 5 years ago

You could hook up the RNN sequence output with a (time-distributed) dense variational and then a distribution output.

@kevinykuo thanks for the advice. I am new to bayesian deep learning, so I am wondering if this is the same approach kyle dorman used in his blogpost I posted before?

_{Sent with GitHawk}

alexv1247 commented 5 years ago

+1 to @kevinykuo. In addition, you can try combining the RNN sequence output with tfp.sts: either using the output as a designmatrix in tfp.sts.*linearregression, or something like mu = rnn_output + sts_model.make_state_space_model and plug mu into a distribution (eg Gaussian). Would be interesting to see what works the best!

I want to build a classification model. From what I ve read in the docs about the tfp.sts models, they are made for regression tasks. So it seems rather unintuitive to use them for classification.

behdadahmadi commented 5 years ago

@alexv1247 @kevinykuo @junpenglao @JP-MRPhys Thank you so much. I wanted to use for Stock price prediction,but I got another issue with LSTM or (ANN at all) They have delay at each prediction. Do you know how to solve it?

cserpell commented 4 years ago

Hi all, I see that nobody added anything in a year, but now I am trying to add weights uncertainty to an LSTM and wanted to use tensorflow probability. I was thinking on copying keras LSTM code and then change weights for variational distributions, adding corresponding losses. Would that direct approach work? Did anyone implement such recurrent layer within tensorflow probability so far? I am not sure if @kevinykuo solution addresses the weight uncertainty problem within the LSTM blocks.

krzysztofrusek commented 4 years ago

Hi,

I also wanted to implement bayesian RNN and here is what I have found so far:

Recurrent dropout in vanilla Keras layers.
Edward2 has the implementation of bayesian LSTM https://github.com/google/edward2/blob/master/edward2/tensorflow/layers/recurrent.py You may use it directly or learn from this implementation.
You may use StochasticGradientLangevinDynamics to sample from the posterior. This is a pure tfp solution, I am not sure if SGLD is compatible with tf 2.x
my ugly hack, that seems to be working

This a custom train loop that can make any Keras model Bayesian and fits surrogate posterior by variational inference. Note that that the gradient of posterior samples wrt posterior parameters is manually connected to gradient with respect of model's weights because tf.assign breaks the gradient flow.

This is just a proof of concept that I would like to make more mature. The more elegant approach would be to use tf.variable_creator_scope to replace every variable in Keras model with the tfp.experimental.nn.util.RandomVariable.

I would love to hear any comments from the tfp team about this approach. Also, I could make a pool request with more polished tfp example if you are interested in such a contribution.


def _make_posterior(v):
    n = len(v.shape)
    return tfd.Independent(tfd.Normal(loc=tf.Variable(v),
                                    scale=tfp.util.TransformedVariable(0.2
+ tf.zeros_like(v), tfp.bijectors.Softplus())),
                           reinterpreted_batch_ndims=n)
def _make_prior(posterior):
    n = len(posterior.event_shape)
    return tfd.Independent(tfd.Normal(tf.zeros(posterior.event_shape), 3.),
                           reinterpreted_batch_ndims=n)

def fit_vi(model, data):
    vars = model.trainable_variables
    posterior = tfp.distributions.JointDistributionSequential( [
_make_posterior(v) for v in vars])
    prior = tfp.distributions.JointDistributionSequential([
        _make_prior(m) for m in posterior.model
    ])

    losses = []
    kls = []
    opt = tf.keras.optimizers.Adam(learning_rate=0.01)

    @tf.function()
    def train(x,y):
        with tf.GradientTape(persistent=True) as tape:
            theta = posterior.sample()

            with tf.control_dependencies([v.assign(s) for v, s in zip(vars,
theta)]):
                yhat = model(x)
            loss = tf.reduce_mean(tf.math.squared_difference(y, yhat))
            kl = posterior.kl_divergence(prior)/3000.
            kls.append(kl)
            losses.append(loss)
        grad = tape.gradient(loss, vars)

        grad2 =[]

        for g in grad:
            grad2.append(g)
            grad2.append(g)

        sample_grad = tape.gradient(theta, posterior.variables)
        kl_grad = tape.gradient(kl, posterior.variables)

        final_grad = [g1 * g2 + g3 for g1, g2, g3 in zip(grad2,
sample_grad, kl_grad)]

        opt.apply_gradients(zip(final_grad, posterior.variables))

    for x, y in data:
        train(x,y)
    return posterior

czw., 4 cze 2020 o 02:23 Cristián Serpell notifications@github.com napisał(a):

Hi all, I see that nobody added anything in a year, but now I am trying to add weights uncertainty to an LSTM and wanted to use tensorflow probability. I was thinking on copying keras LSTM code and then change weights for variational distributions, adding corresponding losses. Would that direct approach work? Did anyone implement such recurrent layer within tensorflow probability so far? I am not sure if @kevinykuo https://github.com/kevinykuo solution addresses the weight uncertainty problem within the LSTM blocks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/394#issuecomment-638527992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB57DISKGPHRXEH765MSULTRU3SO3ANCNFSM4HJ7FZ3A .

-- Krzysztof Rusek

cserpell commented 4 years ago

Thanks for your help. I will have a look. I have already tested monte carlo dropout and it works, though it seems hard to keep the mask if calling LSTM several times if using for generating sequential data step by step.

cserpell commented 4 years ago

I managed to modify LSTM code from tensorflow.python.keras.layers, replacing variable weights as posterior and prior distributions. I could not add the sampling and loss in the call method, because it is called for each recurrence step. Instead, I added the sampling process, and loss, in an auxiliary method called just after _maybe_reset_cell_dropout_mask, which clears current dropout mask, assuming that it runs at the beginning of the recurrence once, so the same sample is used for each step.

Unfortunately, during training the loss fluctuates heavily, even with very small learning rates. I have been playing with ways to describe the scaling in normal prior and posterior distributions. I will ping back when if I get it working, to share what I have learnt.

krzysztofrusek commented 4 years ago

Also, I have found a bug in my approach. It only works with scalar variables, so probably I messed up something with the gradient. Any help with this would be appreciated.

cserpell commented 4 years ago

Following advice of #703 , I multiplied variables by small constants inside normal posteriors, and now it runs without diverging. It does not converge to anything good yet, but it seems to work. Changes in posterior and prior variables probably shift a lot the output distribution, due to the sequential (deep) nature of the network, so it is hard for it to learn. Small changes due to this constants may help.

krzysztofrusek commented 4 years ago

Tanks to tf community my problem is solved (tensorflow/tensorflow#40391) .

The bug was in line ~~loss = tf.reduce_mean(tf.math.squared_difference(y, yhat))~~

it should be

loss = tf.reduce_mean(tf.math.squared_difference(y, yhat[:,0]))

I tested thin on dense layers and it works quite well. RNN is much harder to train, yet I manage to get it learning something.

brianwa84 commented 4 years ago

FYI @jvdillon (We may try to add LSTM layers to tfp.experimental.nn, this is useful discussion)

tensorflow / probability

How to Implement Bayesian LSTM layers for time-series prediction #394