ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
767 stars 187 forks source link

What the proper way of handling NaN for WTTE-RNN #33

Open aprotopopov opened 7 years ago

aprotopopov commented 7 years ago

Hi there. I have a problem causing NaN while training LSTM with wtte.output_lambda and wtte.loss.

First I thought about loss_function which has possibly NaN values for K.log and dividing on a for discrete case:

def loglik_discrete(y, u, a, b, epsilon=1e-35):
    hazard0 = K.pow((y + epsilon) / a, b)
    hazard1 = K.pow((y + 1.0) / a, b)

    loglikelihoods = u * \
        K.log(K.exp(hazard1 - hazard0) - 1.0) - hazard1
return loglikelihoods

After I changed to something like binary_crossentropy no NaN occured but it has no any sense with loss like that.

Then I looked at weights for a simple model like:

def create_model(y_train_users, feature_cols):
    tte_mean_train = np.nanmean(y_train_users[:, :, 0])
    y_censored = y_train_users[:, :, 1]

    init_alpha = -1.0/np.log(1.0 - 1.0/(tte_mean_train + 1.0) )
    init_alpha = init_alpha/np.nanmean(y_censored)

    model = Sequential()
    model.add(LSTM(1, input_shape=(None, len(feature_cols)), activation='tanh', return_sequences=True))
    model.add(Dense(2))
    model.add(Lambda(wtte.output_lambda, arguments={"init_alpha":init_alpha, 
                                                   "max_beta_value":2.5}))

    loss = wtte.loss(kind='discrete', reduce_loss=False).loss_function
    lr = 0.001
    model.compile(loss=loss, optimizer=adam(lr=lr, decay=0.00001, clipnorm=0.5))

    return model

And weights on last two step till NaNs (no any expoits):

>>> model_weights[-2]
[array([[-0.10012437, -0.19260231, -0.23978625,  0.45771736],
        [-0.37926474,  0.01478457,  0.4888621 , -0.03959836]], dtype=float32),
 array([[-0.02832842, -0.26800382,  0.60015482, -0.11135387]], dtype=float32),
 array([ 0.52170336,  1.59952521,  0.17328304,  0.59602541], dtype=float32),
 array([[ 1.50127375,  2.28139687]], dtype=float32),
 array([ 1.09258926, -1.61024928], dtype=float32)]

>>> model_weights[-1]
[array([[ nan,  nan,  nan,  nan],
        [ nan,  nan,  nan,  nan]], dtype=float32),
 array([[ nan,  nan,  nan,  nan]], dtype=float32),
 array([ nan,  nan,  nan,  nan], dtype=float32),
 array([[        nan, -2.13727713]], dtype=float32),
 array([        nan, -1.76466596], dtype=float32)]

It seems that a in output_lambda cause that NaN, but I'm not sure where, because I didn't find any possible expoit there. When I changed it to activation, i.e. sigmoid (which is not making any sense for current task) not NaN occured.

Also I noticed that you used Masking layer and callbacks.TerminateOnNaN in data-pipeline-template. Does that mean that NaN still possible and what the actual reason for causing NaNs?

Sorry for the long post. Hope for your help.

aprotopopov commented 7 years ago

Some comments about data

I'm using users records per day for 60 days with sample fo 20000 users with about 100 features. Train is 16000 users, validation 4000 users. For train I'm hiding last 0.1 frac (6 days).

Without filtering

Without filtering users per records (there are a lot of users with 1 record) I'll get NaNs pretty fast. init_alpha ~ 107.5. Lowest val_loss ~ 0.3095 for reduce_loss=False.

Users num records image

Weight watcher callback image image

With filtering

With filtering users (>= 10 records) training become more stable even with more LSTM neurons, init_alpha ~ 2.77. Plots with 1 LSTM neuron. Lowest val_loss ~val_loss: 0.7786

Users num records image

Weight watcher callback image image

ragulpr commented 7 years ago

General answer

NaN is always a problem, and it's hard to debug. Some starting points:

1) Assume the problem is leaky truth in your data!

2) Initialization is important. Gradients explode if you're too far off causing NAN. 3) More censored data leads to larger gradient steps leading to higher probability of exploding gradient (causing NaN). 4) Learning rate is dependent on data and can be in magnitudes you didn't expect. High learning rates (w.r.t data) may cause NAN.

Some comments about what I've done about this:

If everything above is checked your machine epsilon is a likely culprit. The warning that I put in there should be flagged in that case. Essentially, try to call keras.backend.set_epsilon(1e-08) to lower epsilon.

Analysis of your problem

I assume by "number of records" you mean the amount of observed datapoints (in the pandas dataframe), not number of timesteps that they were under observation. Ex 1-record datapoint may lead to hundreds of empty timesteps in numpy array.

As 1-record customers causes instability I really think the problem is the data. If they log in once and nothing happens the algo will likely know for sure after a few timesteps that they aren't coming back/is dead coinciding with that they are censored making it safe to push distribution towards infinity. Note that Biases did not go NaN (see the non-nan weights) and that only final layer is NaN. Hints towards that upper layer hands over a representation that the output layer knows exactly what to do with (and gradients explodes with joy)

Edit: A hacky solution against obviously dead sequences is to remove the right part of the sequence when it's clearly inferrable that they are dead, ex if x amount of censored timesteps after signup. I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

aprotopopov commented 7 years ago

Thanks for your responses and advice. By number of records I mean number of days when users have sessions. And NaNs here probably due to much censoring.

But I didn't understand where mathematically NaN occured. Possible reasons for NaNs which I see for now:

What other reasons could be for NaN?


I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!

It's very interesting approach. It'll be very helpful to see how you are doing that.


P.S. I think condition for lower epsilon is a bit wrong. Should it be K.epsilon() >= 1e-07 instead of K.epsilon() <= 1e-07?

P.P.S. Hacks to change discrete loss function:

        def loglik_discrete(y, u, a, b, epsilon=1e-35, lowest_val=1e-45):
            a = K.sign(a + lowest_val) * K.maximum(K.abs(a), lowest_val)
            hazard0 = K.pow((y + epsilon) / a, b)
            hazard1 = K.pow((y + 1.0) / a, b)

            log_val = K.clip(K.exp(hazard1 - hazard0) - 1.0, lowest_val,
                             np.inf)
            loglikelihoods = u * K.log(log_val) - hazard1
            return loglikelihoods
ragulpr commented 7 years ago

Some numerical problems I've been thinking about for the discrete case:

This does not cover what can happen in gradients which is another layer of complexity.

I think huge or tiny betas and alphas is up to the calling functions to take care of, i.e having the option of applying this hack to output_lambda or penalties.

I've never looked into whether alpha=0 is a problem, would be very curious to hear if it is helpful. I have done a whole deal of experiments clipping alpha from being huge and this has not been helpful. Let me know if you want more info on this.

TODO:

Numerical instability is the problem with wtte which I've spend huge amount of time on so I'm going to be extremely careful about changing the current working implementation without convincing tests. Due to the complexity I've been unit testing the whole thing instead of small edgecase tests but this would be extremely helpful.

ragulpr commented 7 years ago

Also pertinent to your problem: There's a very subtle philosophical problem at the first step of sequences that may expose the truth:

If a sequence is born due to an event, the first timestep will always have TTE=0. The data pipeline template (supposed to) take care of this by shifting & removing the first timestep. Easy to miss if using your own or modified pipeline