Closed BoneGoat closed 4 years ago
That is odd. Is it possible to share the 1GB text file with sentences you are using. I will just try to re-produce this on my end and check.
Sorry I can't share the text file. It's a partial dump from a CMS and may contain privacy related info. I have cleaned the text but there may still be various random chars. Do you think it would work better if I remove everything but [a-ö] + comma and punctuation?
Do you think it would work better if I remove everything but [a-ö] + comma and punctuation?
No, the char set won't change anything.
I suggest you verify the following things.
for words, tags in zip(x, y):
if not words or not tags: print('!!!')
if len(words) != len(tags): print('!!!')
if not word or not tag: print('!!!')
Tried the code (except for the 3rd if because word is not defined) and the result was no exclamation marks.
Also tried to run the training on a much smaller subset of the training data (32MB) and then it finishes without nan loss:
31250/31250 [==============================] - 7462s 239ms/step - loss: 3.8605
2020-06-12 11:24:32.638943: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
- f1: 90.71
precision recall f1-score support
sent 0.88 0.94 0.91 349448
avg / total 0.88 0.94 0.91 349448
Epoch 00001: f1 improved from -inf to 0.90707, saving model to ./checkpoint
So there must be something wrong with my training data.
The memory issue is still an issue though. The training uses about 14GB of RAM with the smaller training set.
Next I'll lower n_examples to see if it makes any difference.
Now it works! Stops by itself after 5 epochs. Not sure if it's early stopping or if it gets killed.
- f1: 91.44
precision recall f1-score support
sent 0.97 0.86 0.91 35069
avg / total 0.97 0.86 0.91 35069
Epoch 00001: f1 improved from -inf to 0.91444, saving model to ./checkpoint
Epoch 2/15
3125/3125 [==============================] - 758s 242ms/step - loss: 3.2791
- f1: 92.24
precision recall f1-score support
sent 0.93 0.92 0.92 35069
avg / total 0.93 0.92 0.92 35069
Epoch 00002: f1 improved from 0.91444 to 0.92236, saving model to ./checkpoint
Epoch 3/15
3125/3125 [==============================] - 758s 243ms/step - loss: 3.2685
- f1: 91.67
precision recall f1-score support
sent 0.94 0.90 0.92 35069
avg / total 0.94 0.90 0.92 35069
Epoch 00003: f1 did not improve from 0.92236
Epoch 4/15
3125/3125 [==============================] - 758s 242ms/step - loss: 3.2626
- f1: 91.59
precision recall f1-score support
sent 0.93 0.90 0.92 35069
avg / total 0.93 0.90 0.92 35069
Epoch 00004: f1 did not improve from 0.92236
Epoch 5/15
3125/3125 [==============================] - 758s 243ms/step - loss: 3.2596
- f1: 91.48
precision recall f1-score support
sent 0.94 0.89 0.91 35069
avg / total 0.94 0.89 0.91 35069
Epoch 00005: f1 did not improve from 0.92236
It does however segment now:
['under natten har det varit inbrott i ett kontor vid bredåkra kyrka en person gripen misstänkt för inbrottet', 'polisen skriver på sin facebooksida att en av deras hundförare lyckades spåra upp gärningsmannen och det tillgripna godset personen som är i trettiofemårsåldern greps och sitter nu', 'anhållen ingrid elfstråhle p fyra blekinge']
Thanks for your help!
Describe the bug and error messages (if any) Training loss is nan after training for half an epoch. Is there a problem with my params?
Also, batch_size of 32 is as high as I can go. Everything above will OOM.
These params will consume all 32GB of RAM and some swap. May be related to the warning in the log: "UserWarning: Converting sparse IndexedSlices to a dense Tensor with 120289200 elements. This may consume a large amount of memory."
Hardware Intel i7 32GB RAM + 2080Ti 11GB
**The code snippet which gave this error*** Training code:
Log:
Specify versions of the following libraries
Expected behavior I was hoping to get the model to improve and not have nan loss.
Screenshots Nope