ragulpr / wtte-rnn-examples

Examples of implementations of WTTE-RNN
33 stars 15 forks source link

Invalid loss; hello-world-datapipeline #2

Open KeepFloyding opened 6 years ago

KeepFloyding commented 6 years ago

Hello,

Thanks for sharing this awesome example in using wtte-rnn. When I try to run it, I get an invalid loss error during the training phase. Playing around with it, it seems to be very sensitive to the initial value for alpha.

Epoch 35/100 1000/1000 [==============================] - 1s 1ms/step - loss: 1.4000 - val_loss: 1.9154 Epoch 36/100 1000/1000 [==============================] - 1s 1ms/step - loss: 1.3995 - val_loss: 1.9376 Epoch 37/100 1000/1000 [==============================] - 2s 2ms/step - loss: 1.3992 - val_loss: 1.9075 Epoch 38/100 800/1000 [=======================>......] - ETA: 0s - loss: 1.3988Batch 8: Invalid loss, terminating training 900/1000 [==========================>...] - ETA: 0s - loss: nan

Do you happen to know why this may be?

Many thanks, Andris

ragulpr commented 6 years ago

Hi there, That's a correct observation. Initialization is the absolutely most important reason for exploding gradient. If it's far away initially it'll take a huge gradient-step into one direction leading to overshooting the target and/or numerical instability due to large magnitudes.

99.9% of the cases of NaN at later stages of training is errors in data or chosen architecture: 1) Ground truth is leaked/overfitted so perfect (infinity or zero)-prediction possible 2) Censoring is predictable (leading to infinity-prediction) 3) Wrong magnitude of input data 4) Unbounded activation functions of pre-output layer (like relu or similar) leading to instability

Recommended reading:

gabrielgonzaga commented 6 years ago

Hello,

This algorithm is awesome, thank you for sharing those examples!

I also had this problem, but using data-pipeline-template.ipynb. The model actually explodes in the first epoch:

Train on 1141 samples, validate on 171 samples Epoch 1/200 600/1141 [==============>...............] - ETA: 2s - loss: nan Batch 1: Invalid loss, terminating training

I am just running the jupyter notebook exactly as it is. The only difference is in the tensorflow.csv. I am using tensorflow.csv that I acquired using the provided code (which might have a couple months more of data). I tried filtering new data, to have an approximate dataframe of the original execution, but it still failed...

python 3.6.5 pandas.version 0.21.0 numpy.version 1.12.1 keras.version 2.1.6 theano.version 1.0.2 keras episolon: 1e-08

Any ideas why is that happening, since I am following the aprox. same thing as the example? I.e. I am not sure if the discussed topics on 'Recommended reading' would apply here... please correct me if I am wrong.

Thank you very much!

Gabriel


EDIT:

Just saw that this problem was already addressed on the develop branch. It is working now! Thank you!

ragulpr commented 6 years ago

@gabrielgonzaga yes in those months I think some high-frequency committer churned or something but yes it suddenly exploded lol. It'll be addressed in https://github.com/ragulpr/wtte-rnn/issues/41 until then just find the right initial alpha.