Loss dropping to 1e-6 - Githubissues

ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction

MIT License

767 stars 187 forks source link

Loss dropping to 1e-6 #49

Open taxus13 opened 6 years ago

taxus13 commented 6 years ago

Hi, thank you for your framework. I am trying to use it for charge event prediction at charging stations. For this, I have downsampled the data to 5min steps and pass up to one week history to the network.

This is my network (copied from one of your examples)

reduce_lr = callbacks.ReduceLROnPlateau(monitor='loss', factor  =0.5, patience=50, verbose=0, mode='auto', min_delta=0.0001, 
                                        cooldown=0, min_lr=1e-8)

nanterminator = callbacks.TerminateOnNaN()

model = Sequential()
#Dont need to specify input_shape=(n_timesteps, n_features) since keras uses dynamic rnn by default
model.add(CuDNNLSTM(10, input_shape=(None, n_features),return_sequences=True))
model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=0.01))
model.add(CuDNNLSTM(20, return_sequences=True))
model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=0.01))
model.add(GaussianDropout(0.05))
model.add(CuDNNLSTM(5, return_sequences=True))
model.add(Dense(2))
model.add(Lambda(wtte.output_lambda, 
                 arguments={"init_alpha":init_alpha, 
                            "max_beta_value":4.0,
                            "scalefactor":1/np.log(5)}))  

loss = wtte.Loss(kind='discrete').loss_function

However, if I train the network, the loss sometimes drops to 1e-6:

Train on 12384 samples, validate on 12384 samples
Epoch 1/20
12384/12384 [==============================] - 5s 403us/step - loss: 0.1205 - val_loss: 1.0000e-06
Epoch 2/20
  927/12384 [=>............................] - ETA: 1s - loss: 1.0000e-06
C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning:

invalid value encountered in reduce

C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning:

invalid value encountered in reduce

11433/12384 [==========================>...] - ETA: 0s - loss: 1.0000e-06

Once it had this loss, the loss does not change anymore. I guess there is something with the data?

Sometimes it helps to change the kind of the Loss function from discrete to continouos (btw, when should I use discrete, when continouos?). What is the general problem here? I tried to reduce the network size, but I had no success.

ragulpr commented 6 years ago

Since WTTE 1.1 the loss is clipped per default to 1e-6 so NaN is synonymous. Working from the assumption that it's a NaN-problem, there's alot of reasons for it, see for example my response and link to https://github.com/ragulpr/wtte-rnn-examples/issues/2. What cures 99% of problems, use :

It's some kind of data problem.
"pretrain step" of the bias term in the output layer, see In [12]
Clipping log logikelihood e.g loss_fun = wtte.loss(kind='discrete',reduce_loss=False,clip_prob=1e-5).loss_function
Weight away invalid values (ex those outside mask if varying length sequences). You'll need to add reduce_loss=False as above and sample_weight_mode='temporal' when compiling model and then supply sample_weight tensor during training

If we assume it's not a numerical (NaN) problem and you actually achieve zero-loss that can only be done with perfect fit. If all your samples are censored (y[...,1]=0) then predicting infinity will achieve this. If you set clip_prob=None and still achieve small loss then that might be the problem.

When to use discrete and when to use continuous loss?

Continuous data is very rare. If you have very large range of time to event ex (0,10000) or different values of TTE then it can be considered as continuous but it's arbitrary. If you have time to event with zeros then it's not continuous. (Note that prob of continuous TTE=0 is 0 so log-prob -Inf).

Rule of thumb is that real world data is discrete, but if a histogram of TTE is very smooth with no zeros you can consider it as continuous. I've always used discrete in real world applications

taxus13 commented 6 years ago

Hi,

thank you for your response!

This is a histogram of my ttes, which is roughly Weibull distributed: hist_tte Pretraining leads to the following values:

(after  pretrain)    alpha:  54.145554   beta :  0.9117085```
The loss changes slightly from 4.5338 to 4.5277.

I changed the loss function to
`loss = wtte.loss(kind='discrete',reduce_loss=False,clip_prob=None).loss_function`

But this configurations leads to the following:

```Epoch 73/500
16704/16704 [==============================] - 6s 351us/step - loss: 3.8392 - val_loss: 4.3967
Epoch 74/500
10842/16704 [==================>...........] - ETA: 1s - loss: -inf  Batch 25: Invalid loss, terminating training
C:\Python36\python-3.6.2.amd64\lib\site-packages\keras\utils\generic_utils.py:338: RuntimeWarning:
invalid value encountered in multiply
C:\Python36\python-3.6.2.amd64\lib\site-packages\keras\utils\generic_utils.py:338: RuntimeWarning:
invalid value encountered in double_scalars
C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:29: RuntimeWarning:
invalid value encountered in reduce
C:\Python36\python-3.6.2.amd64\lib\site-packages\numpy\core\_methods.py:26: RuntimeWarning:
invalid value encountered in reduce

About my task: I have a very long sequence of events (about three or more months) with a sample time of 5 or 10 minutes. The goal is to predict the events of the next 24 hours. Currently I am using subsequences of two days oder one week for training. If I use two days for training, I have 16704 sequences with 576 samples each. Is this too much? The mean uncensored is 0.89.

Is there a value for the mean uncensored for which wtte-rnn works best?
Should the sequences overlap?

ragulpr commented 6 years ago

Thanks for the detailed response. You seem to have discrete nice data with very little censoring. Here's some thoughts!:

Is there a value for the mean uncensored for which wtte-rnn works best?
- No. But censored data contains less information (roughly said) than uncensored data so it'll overfit more easy. You have much more uncensored data than I usually work with.
Should the sequences overlap?
- This is a dataset dependent thing I can't say anything about sorry. It does make sense to calculate TE before splitting each long sequences (months) into subsequences (days).

[ ] Try without the recurrent layers (start simple! Maybe just one layer!)
[ ] Figure out if TTE or censoring indicators is perfectly predictable from data. I.e if truth is leaking.
[ ] Where did you apply mask?
[ ] Are you sure NaN is not comming from Batchnormalization? Ex small batchsize necessitates more momentum ex 0.995
[ ] Are you sure NaN is not comming from feature data?

Also, LSTM may work on long sequences (20k<<) but whether it makes sense to have as long sequences (like 576 as suggested) is a data/domain specific question sorry. NLP folks probably would say its alot, but it all depends. I usually use GRU for sequences up to 2000 timesteps, not sure how wise that is

taxus13 commented 6 years ago

Hi, thank you for your help!

My only features were the event indicator (x==1 when tte==0) and the time (scaled from 0 to 1).

Today I tried the following:
- Removing BatchNormalization and GaussianDropout
- Using only the event indicator as a feature
- With recurrent layers, I still got the issue
- With only Dense Layers, I got the nan-issue when using relu as activation function. sigmoid worked for me:
```
model.add(TimeDistributed(Dense(50, activation='sigmoid')))
model.add(TimeDistributed(Dense(200, activation='sigmoid')))
model.add(Dense(2))
model.add(Lambda(wtte.output_lambda, 
         arguments={"init_alpha":init_alpha, 
                    "max_beta_value":4.0,
                    "scalefactor":1/np.log(5)})
```
  However, this network does not learn much, alpha and beta remain mostly constant.
With only event indicators and the time the truth should not be leaked
As I am extracting my training sequences from one large sequence I do not have to apply masking. This is what the first 200 samples look like:

But I am wondering, if this approach is suitable for my task. For predicting the next event, I am using the last alpha and beta pair. With this pair I predict the tte to the next event by using the mode. This event is the starting point for the next prediction. Is this approach reasonable?

ragulpr commented 6 years ago

You can try:

model.add(TimeDistributed(Dense(50, activation='Relu')))
model.add(TimeDistributed(Dense(200, activation='Relu'))) ## Preferably TanH
model.add(Dense(2))
model.add(Lambda(wtte.output_lambda, 
                 arguments={"init_alpha":init_alpha, 
                            "max_beta_value":4.0,
                            "scalefactor":1/np.log(200)}) ####### Note scale it by pre-outputlayer width

Even though Relu will be inherently less stable. I suggest using TanH (or sigmoid or similar) for this layer but I think this is only small part of the problem. The problem is in the data.

If x==1 implies y==0 then you have truth leaking. You're in a discrete world so think feature event indicator x==1 as "Today there was an event" while target y==0 as "tomorrow there'll be an event". Hence you need to disalign features and target.

Whether your definitions are suitable all depends on what you're trying to predict. Maybe you need to elaborate on what predicting the next event means. Here you defined it as the mode, or "the point where it's most likely that the event should happen". Thing is, if it's a decreasing hazard (beta<=1) that point is always 0, or "right now is the most likely". It's not very informative or useful.

However you twist or turn predicting the next event is an arbitrary question. You could predict mean, median ("the point where in 50% of cases the event is predicted to have happened") or similar, or use prediction to randomly sample the current tte. What makes sense is dependent on application :)

I'm not sure what you mean by "starting point for next prediction". Are you trying to generate new sequences of events?

ragulpr commented 6 years ago

Also, since you don't seem to care about history <<1000 steps ago I suggest you use some dilated causal CNN instead of RNN! It's also much faster to train.

This is a wavenet-ish network with a 256 timestep memory

model = Sequential()
model.add(InputLayer(input_shape=(x_train.shape[1], n_features)))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=1, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=2, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=4, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=8, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=16, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=32, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=64, ))
model.add(Conv1D(filters=10,activation='tanh', kernel_size=2, strides=1, padding='causal', dilation_rate=128, ))
## Wtte-RNN part
model.add(TimeDistributed(Dense(2)))

taxus13 commented 6 years ago

Awesome, thank you very much! May I buy you a beer or something?

I fixed my data, now it looks like this:

timestamp	event marker	tte
...	...	...
2017-01-02 13:40:00	0.0	1.0
2017-01-02 13:50:00	0.0	0.0
2017-01-02 14:00:00	1.0	9.0
...	...	...

However, I still get nan-loss with and without recurrent layers. Your suggestion with convolutional layers works, but it also fails when I add an additional Conv1D with dilation_rate=256.

I would like to predict the events in the next 24h, or generating a new sequence of events. I am aware hat I am loosing the probability value of the tte, but that is ok. I was thinking about using the mode for this task, since it still works for beta<=0, but it gives me 84 events per day instead of 5 (in a test day)