Training loss: nan - Githubissues

BoneGoat commented 4 years ago

Describe the bug and error messages (if any) Training loss is nan after training for half an epoch. Is there a problem with my params?

Also, batch_size of 32 is as high as I can go. Everything above will OOM.

These params will consume all 32GB of RAM and some swap. May be related to the warning in the log: "UserWarning: Converting sparse IndexedSlices to a dense Tensor with 120289200 elements. This may consume a large amount of memory."

Hardware Intel i7 32GB RAM + 2080Ti 11GB

**The code snippet which gave this error*** Training code:

from deepsegment import train, generate_data

import unicodedata
import re
from tqdm import tqdm

lines = ...Reading all lines from 1GB text file...

print('ok')

x, y = generate_data(lines[10000:], max_sents_per_example=6, n_examples=1000000)
vx, vy = generate_data(lines[:10000], max_sents_per_example=6, n_examples=100000)

train(x, y, vx, vy, epochs=15, batch_size=32, save_folder='./', glove_path='cc.sv.100.vec')

Log:

Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
W0611 11:19:13.705538 139626754037568 deepsegment.py:22] Tensorflow serving is not installed. Cannot be used with tesnorflow serving docker images.
W0611 11:19:13.705653 139626754037568 deepsegment.py:23] Run pip install tensorflow-serving-api==1.12.0 if you want to use with tf serving.
W0611 11:19:13.706187 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/deepsegment/train.py:9: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0611 11:19:13.706330 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/deepsegment/train.py:11: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2020-06-11 11:19:13.719930: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-11 11:19:13.726062: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-06-11 11:19:13.850172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:13.851079: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4d4fe50 executing computations on platform CUDA. Devices:
2020-06-11 11:19:13.851095: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-11 11:19:13.863673: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4008000000 Hz
2020-06-11 11:19:13.864927: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a1f2d0 executing computations on platform Host. Devices:
2020-06-11 11:19:13.864978: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-06-11 11:19:13.865398: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:13.866497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:01:00.0
2020-06-11 11:19:13.869341: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:19:13.906978: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-11 11:19:13.925570: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-11 11:19:13.931635: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-11 11:19:13.972803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-06-11 11:19:13.999081: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-06-11 11:19:14.073585: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:19:14.073906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.075861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.077537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-06-11 11:19:14.078231: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:19:14.081715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 11:19:14.081795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-06-11 11:19:14.081834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-06-11 11:19:14.082829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.084705: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:19:14.086398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10309 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10126601/10126601 [02:57<00:00, 57064.71it/s]
ok

100% (1000000 of 1000000) |############################################################################################################################################################################| Elapsed Time: 0:01:23 Time:  0:01:23

100% (100000 of 100000) |##############################################################################################################################################################################| Elapsed Time: 0:00:06 Time:  0:00:06
2020-06-11 11:29:57.768082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.768525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:01:00.0
2020-06-11 11:29:57.768556: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-06-11 11:29:57.768567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-11 11:29:57.768577: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-06-11 11:29:57.768587: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-06-11 11:29:57.768596: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-06-11 11:29:57.768605: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-06-11 11:29:57.768615: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-11 11:29:57.768665: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769086: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-06-11 11:29:57.769490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-11 11:29:57.769497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-06-11 11:29:57.769502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-06-11 11:29:57.769566: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.769987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-11 11:29:57.770382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10309 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
W0611 11:30:00.458791 139626754037568 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py:90: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 120289200 elements. This may consume a large amount of memory.
  num_elements)
W0611 11:30:08.525202 139626754037568 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Epoch 1/15
2020-06-11 11:30:10.113686: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

    1/31250 [..............................] - ETA: 19:39:54 - loss: 3.1761
    2/31250 [..............................] - ETA: 11:03:17 - loss: 3.0586
    3/31250 [..............................] - ETA: 8:08:36 - loss: 3.3094 
    4/31250 [..............................] - ETA: 6:34:41 - loss: 3.4420
    5/31250 [..............................] - ETA: 5:38:11 - loss: 3.3396
    6/31250 [..............................] - ETA: 5:00:34 - loss: 3.1736
    7/31250 [..............................] - ETA: 4:38:08 - loss: 3.1815
    8/31250 [..............................] - ETA: 4:19:09 - loss: 3.2167
    9/31250 [..............................] - ETA: 4:06:03 - loss: 3.2543
   <Snipped>
15548/31250 [=============>................] - ETA: 1:08:47 - loss: 3.2829
15549/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828
15550/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828
15551/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2828
15552/31250 [=============>................] - ETA: 1:08:46 - loss: 3.2827
15553/31250 [=============>................] - ETA: 1:08:45 - loss: 3.2828
15554/31250 [=============>................] - ETA: 1:08:45 - loss: nan   
15555/31250 [=============>................] - ETA: 1:08:45 - loss: nan
15556/31250 [=============>................] - ETA: 1:08:45 - loss: nan
15557/31250 [=============>................] - ETA: 1:08:44 - loss: nan
15558/31250 [=============>................] - ETA: 1:08:44 - loss: nan
15559/31250 [=============>................] - ETA: 1:08:44 - loss: nan
<Snipped>
31248/31250 [============================>.] - ETA: 0s - loss: nan
31249/31250 [============================>.] - ETA: 0s - loss: nan
31250/31250 [==============================] - 8239s 264ms/step - loss: nan
2020-06-11 13:47:28.196777: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418

Epoch 00001: f1 improved from -inf to 0.00000, saving model to ./checkpoint
Epoch 2/15

    1/31250 [..............................] - ETA: 2:40:55 - loss: nan
    2/31250 [..............................] - ETA: 2:54:27 - loss: nan
    3/31250 [..............................] - ETA: 2:37:31 - loss: nan
    4/31250 [..............................] - ETA: 2:35:34 - loss: nan
    5/31250 [..............................] - ETA: 2:28:59 - loss: nan
    6/31250 [..............................] - ETA: 2:30:40 - loss: nan
    7/31250 [..............................] - ETA: 2:28:10 - loss: nan
    8/31250 [..............................] - ETA: 2:27:05 - loss: nan
    9/31250 [..............................] - ETA: 2:27:35 - loss: nan
   <Snipped>
31244/31250 [============================>.] - ETA: 1s - loss: nan
31245/31250 [============================>.] - ETA: 1s - loss: nan
31246/31250 [============================>.] - ETA: 1s - loss: nan
31247/31250 [============================>.] - ETA: 0s - loss: nan
31248/31250 [============================>.] - ETA: 0s - loss: nan
31249/31250 [============================>.] - ETA: 0s - loss: nan
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418

Epoch 00002: f1 did not improve from 0.00000
Epoch 3/15

    1/31250 [..............................] - ETA: 2:16:22 - loss: nan
    2/31250 [..............................] - ETA: 2:17:49 - loss: nan
    3/31250 [..............................] - ETA: 2:28:30 - loss: nan
    4/31250 [..............................] - ETA: 2:54:15 - loss: nan
    5/31250 [..............................] - ETA: 2:46:57 - loss: nan
    6/31250 [..............................] - ETA: 2:53:14 - loss: nan
    7/31250 [..............................] - ETA: 2:44:51 - loss: nan
    8/31250 [..............................] - ETA: 2:38:22 - loss: nan
    9/31250 [..............................] - ETA: 2:34:27 - loss: nan
   <Snipped>
31246/31250 [============================>.] - ETA: 1s - loss: nan
31247/31250 [============================>.] - ETA: 0s - loss: nan
31248/31250 [============================>.] - ETA: 0s - loss: nan
31249/31250 [============================>.] - ETA: 0s - loss: nan
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418

Epoch 00003: f1 did not improve from 0.00000
Epoch 4/15

    1/31250 [..............................] - ETA: 1:57:41 - loss: nan
    2/31250 [..............................] - ETA: 2:05:43 - loss: nan
    3/31250 [..............................] - ETA: 2:09:32 - loss: nan
    4/31250 [..............................] - ETA: 2:10:20 - loss: nan
    <Snipped>
31250/31250 [==============================] - 8214s 263ms/step - loss: nan
 - f1: 0.00
             precision    recall  f1-score   support

       sent       0.00      0.00      0.00    349418

avg / total       0.00      0.00      0.00    349418

Epoch 00004: f1 did not improve from 0.00000

Exited here....

Specify versions of the following libraries

deepsegment / latest
tensorflow-gpu / 1.14.0
keras / 2.3.1
docker image / tensorflow/tensorflow:1.14.0-gpu-py3-jupyter

Expected behavior I was hoping to get the model to improve and not have nan loss.

Screenshots Nope

bedapudi6788 commented 4 years ago

That is odd. Is it possible to share the 1GB text file with sentences you are using. I will just try to re-produce this on my end and check.

BoneGoat commented 4 years ago

Sorry I can't share the text file. It's a partial dump from a CMS and may contain privacy related info. I have cleaned the text but there may still be various random chars. Do you think it would work better if I remove everything but [a-ö] + comma and punctuation?

bedapudi6788 commented 4 years ago

Do you think it would work better if I remove everything but [a-ö] + comma and punctuation?

No, the char set won't change anything.

I suggest you verify the following things.

for words, tags in zip(x, y):
    if not words or not tags: print('!!!')
    if len(words) != len(tags): print('!!!')
    if not word or not tag: print('!!!')

BoneGoat commented 4 years ago

Tried the code (except for the 3rd if because word is not defined) and the result was no exclamation marks.

Also tried to run the training on a much smaller subset of the training data (32MB) and then it finishes without nan loss:

31250/31250 [==============================] - 7462s 239ms/step - loss: 3.8605
2020-06-12 11:24:32.638943: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
 - f1: 90.71
             precision    recall  f1-score   support

       sent       0.88      0.94      0.91    349448

avg / total       0.88      0.94      0.91    349448

Epoch 00001: f1 improved from -inf to 0.90707, saving model to ./checkpoint

So there must be something wrong with my training data.

The memory issue is still an issue though. The training uses about 14GB of RAM with the smaller training set.

Next I'll lower n_examples to see if it makes any difference.

BoneGoat commented 4 years ago

Now it works! Stops by itself after 5 epochs. Not sure if it's early stopping or if it gets killed.

 - f1: 91.44
             precision    recall  f1-score   support

       sent       0.97      0.86      0.91     35069

avg / total       0.97      0.86      0.91     35069

Epoch 00001: f1 improved from -inf to 0.91444, saving model to ./checkpoint
Epoch 2/15
3125/3125 [==============================] - 758s 242ms/step - loss: 3.2791
 - f1: 92.24
             precision    recall  f1-score   support

       sent       0.93      0.92      0.92     35069

avg / total       0.93      0.92      0.92     35069

Epoch 00002: f1 improved from 0.91444 to 0.92236, saving model to ./checkpoint
Epoch 3/15
3125/3125 [==============================] - 758s 243ms/step - loss: 3.2685
 - f1: 91.67
             precision    recall  f1-score   support

       sent       0.94      0.90      0.92     35069

avg / total       0.94      0.90      0.92     35069

Epoch 00003: f1 did not improve from 0.92236
Epoch 4/15
3125/3125 [==============================] - 758s 242ms/step - loss: 3.2626
 - f1: 91.59
             precision    recall  f1-score   support

       sent       0.93      0.90      0.92     35069

avg / total       0.93      0.90      0.92     35069

Epoch 00004: f1 did not improve from 0.92236
Epoch 5/15
3125/3125 [==============================] - 758s 243ms/step - loss: 3.2596
 - f1: 91.48
             precision    recall  f1-score   support

       sent       0.94      0.89      0.91     35069

avg / total       0.94      0.89      0.91     35069

Epoch 00005: f1 did not improve from 0.92236

It does however segment now:

['under natten har det varit inbrott i ett kontor vid bredåkra kyrka en person gripen misstänkt för inbrottet', 'polisen skriver på sin facebooksida att en av deras hundförare lyckades spåra upp gärningsmannen och det tillgripna godset personen som är i trettiofemårsåldern greps och sitter nu', 'anhållen ingrid elfstråhle p fyra blekinge']

Thanks for your help!

notAI-tech / deepsegment

Training loss: nan #32