Hi there. I have a problem causing NaN while training LSTM with wtte.output_lambda and wtte.loss.

First I thought about loss_function which has possibly NaN values for K.log and dividing on a for discrete case:

def loglik_discrete(y, u, a, b, epsilon=1e-35):
    hazard0 = K.pow((y + epsilon) / a, b)
    hazard1 = K.pow((y + 1.0) / a, b)

    loglikelihoods = u * \
        K.log(K.exp(hazard1 - hazard0) - 1.0) - hazard1
return loglikelihoods

After I changed to something like binary_crossentropy no NaN occured but it has no any sense with loss like that.

Then I looked at weights for a simple model like:

def create_model(y_train_users, feature_cols):
    tte_mean_train = np.nanmean(y_train_users[:, :, 0])
    y_censored = y_train_users[:, :, 1]

    init_alpha = -1.0/np.log(1.0 - 1.0/(tte_mean_train + 1.0) )
    init_alpha = init_alpha/np.nanmean(y_censored)

    model = Sequential()
    model.add(LSTM(1, input_shape=(None, len(feature_cols)), activation='tanh', return_sequences=True))
    model.add(Dense(2))
    model.add(Lambda(wtte.output_lambda, arguments={"init_alpha":init_alpha, 
                                                   "max_beta_value":2.5}))

    loss = wtte.loss(kind='discrete', reduce_loss=False).loss_function
    lr = 0.001
    model.compile(loss=loss, optimizer=adam(lr=lr, decay=0.00001, clipnorm=0.5))

    return model

And weights on last two step till NaNs (no any expoits):

>>> model_weights[-2]
[array([[-0.10012437, -0.19260231, -0.23978625,  0.45771736],
        [-0.37926474,  0.01478457,  0.4888621 , -0.03959836]], dtype=float32),
 array([[-0.02832842, -0.26800382,  0.60015482, -0.11135387]], dtype=float32),
 array([ 0.52170336,  1.59952521,  0.17328304,  0.59602541], dtype=float32),
 array([[ 1.50127375,  2.28139687]], dtype=float32),
 array([ 1.09258926, -1.61024928], dtype=float32)]

>>> model_weights[-1]
[array([[ nan,  nan,  nan,  nan],
        [ nan,  nan,  nan,  nan]], dtype=float32),
 array([[ nan,  nan,  nan,  nan]], dtype=float32),
 array([ nan,  nan,  nan,  nan], dtype=float32),
 array([[        nan, -2.13727713]], dtype=float32),
 array([        nan, -1.76466596], dtype=float32)]

It seems that a in output_lambda cause that NaN, but I'm not sure where, because I didn't find any possible expoit there. When I changed it to activation, i.e. sigmoid (which is not making any sense for current task) not NaN occured.

Also I noticed that you used Masking layer and callbacks.TerminateOnNaN in data-pipeline-template. Does that mean that NaN still possible and what the actual reason for causing NaNs?

Sorry for the long post. Hope for your help.

Some comments about data

I'm using users records per day for 60 days with sample fo 20000 users with about 100 features. Train is 16000 users, validation 4000 users. For train I'm hiding last 0.1 frac (6 days).

Without filtering

Without filtering users per records (there are a lot of users with 1 record) I'll get NaNs pretty fast. init_alpha ~ 107.5. Lowest val_loss ~ 0.3095 for reduce_loss=False.

Users num records

Weight watcher callback

With filtering

With filtering users (>= 10 records) training become more stable even with more LSTM neurons, init_alpha ~ 2.77. Plots with 1 LSTM neuron. Lowest val_loss ~val_loss: 0.7786

Users num records

Weight watcher callback

General answer

NaN is always a problem, and it's hard to debug. Some starting points:

1) Assume the problem is leaky truth in your data!

If truth about TTE is leaking beta may tend to infinity causing instability (unless capped with max_beta).
If truth about censoring is leaking alpha tends to inf and/or beta to 0 eventually causing NaN.
- Broken masks or negative TTE, 0s in TTE if using continuous loss function

2) Initialization is important. Gradients explode if you're too far off causing NAN. 3) More censored data leads to larger gradient steps leading to higher probability of exploding gradient (causing NaN). 4) Learning rate is dependent on data and can be in magnitudes you didn't expect. High learning rates (w.r.t data) may cause NAN.

Some comments about what I've done about this:

Initialization is stable (as it's tested to be initialized around init_alpha, beta=1)
Loss function is convergent (tested to be run close enough to expected values)
Loss function is unlikely to deteriorate due to log(0) or divide by zero thanks to the added epsilon but this may happen anyway (no formal test but could be checked by transforming above test with an initialization far away from the expected value)

If everything above is checked your machine epsilon is a likely culprit. The warning that I put in there should be flagged in that case. Essentially, try to call keras.backend.set_epsilon(1e-08) to lower epsilon.

Analysis of your problem

I assume by "number of records" you mean the amount of observed datapoints (in the pandas dataframe), not number of timesteps that they were under observation. Ex 1-record datapoint may lead to hundreds of empty timesteps in numpy array.

As 1-record customers causes instability I really think the problem is the data. If they log in once and nothing happens the algo will likely know for sure after a few timesteps that they aren't coming back/is dead coinciding with that they are censored making it safe to push distribution towards infinity. Note that Biases did not go NaN (see the non-nan weights) and that only final layer is NaN. Hints towards that upper layer hands over a representation that the output layer knows exactly what to do with (and gradients explodes with joy)

Edit: A hacky solution against obviously dead sequences is to remove the right part of the sequence when it's clearly inferrable that they are dead, ex if x amount of censored timesteps after signup. I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

Thanks for your responses and advice. By number of records I mean number of days when users have sessions. And NaNs here probably due to much censoring.

But I didn't understand where mathematically NaN occured. Possible reasons for NaNs which I see for now:

expoiding gradients (I'm clipping it in optimizer)
expoit in activation/output_lambda (I didn't find any division by zero or logarithms from zero or negative values)
expoit in loss (I modified some could to exclude any possible expoits)

What other reasons could be for NaN?

I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

It's very interesting approach. It'll be very helpful to see how you are doing that.

P.S. I think condition for lower epsilon is a bit wrong. Should it be K.epsilon() >= 1e-07 instead of K.epsilon() <= 1e-07?

P.P.S. Hacks to change discrete loss function:

        def loglik_discrete(y, u, a, b, epsilon=1e-35, lowest_val=1e-45):
            a = K.sign(a + lowest_val) * K.maximum(K.abs(a), lowest_val)
            hazard0 = K.pow((y + epsilon) / a, b)
            hazard1 = K.pow((y + 1.0) / a, b)

            log_val = K.clip(K.exp(hazard1 - hazard0) - 1.0, lowest_val,
                             np.inf)
            loglikelihoods = u * K.log(log_val) - hazard1
            return loglikelihoods

Some numerical problems I've been thinking about for the discrete case:

alpha = 0 leading to divide by zero.
y == 0 leading to log(0) since K.pow may be implemented as z^b = exp[log(z)b]. The y + epsilon supposed to takes care of this and does a good job at it.
alpha == Inf causing (y + epsilon) / a == 0 leading to log(0). Haven't been looking into this.
b<<1 and b>>1. See how 0 == K.exp(hazard1 - hazard0) - 1.0 could happen whenever beta<<1 and/or when alpha>>y and beta>>1

This does not cover what can happen in gradients which is another layer of complexity.

I think huge or tiny betas and alphas is up to the calling functions to take care of, i.e having the option of applying this hack to output_lambda or penalties.

I've never looked into whether alpha=0 is a problem, would be very curious to hear if it is helpful. I have done a whole deal of experiments clipping alpha from being huge and this has not been helpful. Let me know if you want more info on this.

TODO:

Numerical instability is the problem with wtte which I've spend huge amount of time on so I'm going to be extremely careful about changing the current working implementation without convincing tests. Due to the complexity I've been unit testing the whole thing instead of small edgecase tests but this would be extremely helpful.

Need tests to decide/test tolerance levels leading to exploding gradients i.e testing loss and gradient stability with varying a, b and y
Fix K.epsilon() >= 1e-07 so that it throws the right error :)

Also pertinent to your problem: There's a very subtle philosophical problem at the first step of sequences that may expose the truth:

If a sequence is born due to an event, the first timestep will always have TTE=0. The data pipeline template (supposed to) take care of this by shifting & removing the first timestep. Easy to miss if using your own or modified pipeline

ragulpr / wtte-rnn

What the proper way of handling NaN for WTTE-RNN #33