Open aprotopopov opened 7 years ago
I'm using users records per day for 60 days with sample fo 20000 users with about 100 features. Train is 16000 users, validation 4000 users. For train I'm hiding last 0.1 frac (6 days).
Without filtering users per records (there are a lot of users with 1 record) I'll get NaNs
pretty fast. init_alpha ~ 107.5
. Lowest val_loss ~ 0.3095
for reduce_loss=False
.
Users num records
Weight watcher callback
With filtering users (>= 10 records) training become more stable even with more LSTM neurons, init_alpha ~ 2.77
. Plots with 1 LSTM neuron. Lowest val_loss ~val_loss: 0.7786
Users num records
Weight watcher callback
NaN is always a problem, and it's hard to debug. Some starting points:
1) Assume the problem is leaky truth in your data!
max_beta
).2) Initialization is important. Gradients explode if you're too far off causing NAN. 3) More censored data leads to larger gradient steps leading to higher probability of exploding gradient (causing NaN). 4) Learning rate is dependent on data and can be in magnitudes you didn't expect. High learning rates (w.r.t data) may cause NAN.
Some comments about what I've done about this:
If everything above is checked your machine epsilon is a likely culprit. The warning that I put in there should be flagged in that case. Essentially, try to call keras.backend.set_epsilon(1e-08)
to lower epsilon.
I assume by "number of records" you mean the amount of observed datapoints (in the pandas dataframe), not number of timesteps that they were under observation. Ex 1-record datapoint may lead to hundreds of empty timesteps in numpy array.
As 1-record customers causes instability I really think the problem is the data. If they log in once and nothing happens the algo will likely know for sure after a few timesteps that they aren't coming back/is dead coinciding with that they are censored making it safe to push distribution towards infinity. Note that Biases did not go NaN (see the non-nan weights) and that only final layer is NaN. Hints towards that upper layer hands over a representation that the output layer knows exactly what to do with (and gradients explodes with joy)
Edit: A hacky solution against obviously dead sequences is to remove the right part of the sequence when it's clearly inferrable that they are dead, ex if x amount of censored timesteps after signup. I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!
Thanks for your responses and advice. By number of records I mean number of days when users have sessions. And NaNs here probably due to much censoring.
But I didn't understand where mathematically NaN
occured. Possible reasons for NaNs
which I see for now:
What other reasons could be for NaN
?
I've done successful testing with a non-hacky approach to this (using the fact that deadness coincides with that you can predict that the timestep is censored!) by predicting prob. of censoring and use it for weighting away censored datapoints but haven't had time to do a writeup on it yet!
It's very interesting approach. It'll be very helpful to see how you are doing that.
P.S. I think condition for lower epsilon is a bit wrong. Should it be K.epsilon() >= 1e-07
instead of K.epsilon() <= 1e-07
?
P.P.S. Hacks to change discrete loss function:
def loglik_discrete(y, u, a, b, epsilon=1e-35, lowest_val=1e-45):
a = K.sign(a + lowest_val) * K.maximum(K.abs(a), lowest_val)
hazard0 = K.pow((y + epsilon) / a, b)
hazard1 = K.pow((y + 1.0) / a, b)
log_val = K.clip(K.exp(hazard1 - hazard0) - 1.0, lowest_val,
np.inf)
loglikelihoods = u * K.log(log_val) - hazard1
return loglikelihoods
Some numerical problems I've been thinking about for the discrete case:
alpha = 0
leading to divide by zero.y == 0
leading to log(0)
since K.pow
may be implemented as z^b = exp[log(z)b]
. The y + epsilon
supposed to takes care of this and does a good job at it.alpha == Inf
causing (y + epsilon) / a == 0
leading to log(0)
. Haven't been looking into this.b<<1
and b>>1
. See how 0 == K.exp(hazard1 - hazard0) - 1.0
could happen whenever beta<<1
and/or when alpha>>y
and beta>>1
This does not cover what can happen in gradients which is another layer of complexity.
I think huge or tiny betas and alphas is up to the calling functions to take care of, i.e having the option of applying this hack to output_lambda
or penalties.
I've never looked into whether alpha=0
is a problem, would be very curious to hear if it is helpful. I have done a whole deal of experiments clipping alpha from being huge and this has not been helpful. Let me know if you want more info on this.
Numerical instability is the problem with wtte which I've spend huge amount of time on so I'm going to be extremely careful about changing the current working implementation without convincing tests. Due to the complexity I've been unit testing the whole thing instead of small edgecase tests but this would be extremely helpful.
K.epsilon() >= 1e-07
so that it throws the right error :)Also pertinent to your problem: There's a very subtle philosophical problem at the first step of sequences that may expose the truth:
If a sequence is born due to an event, the first timestep will always have TTE=0. The data pipeline template (supposed to) take care of this by shifting & removing the first timestep. Easy to miss if using your own or modified pipeline
Hi there. I have a problem causing NaN while training LSTM with
wtte.output_lambda
andwtte.loss
.First I thought about
loss_function
which has possiblyNaN
values forK.log
and dividing ona
for discrete case:After I changed to something like binary_crossentropy no
NaN
occured but it has no any sense with loss like that.Then I looked at weights for a simple model like:
And weights on last two step till NaNs (no any expoits):
It seems that
a
inoutput_lambda
cause thatNaN
, but I'm not sure where, because I didn't find any possible expoit there. When I changed it to activation, i.e. sigmoid (which is not making any sense for current task) not NaN occured.Also I noticed that you used
Masking
layer andcallbacks.TerminateOnNaN
in data-pipeline-template. Does that mean thatNaN
still possible and what the actual reason for causing NaNs?Sorry for the long post. Hope for your help.