Closed hdmetor closed 7 years ago
There is also something else weird: this commit replaces 2 LSTM with a single one. Is that wanted? If so, the comment on line 51 should be taken off.
I saw this same issue on today's HEAD. Before that, something weird happened, making me suspect something went wrong during the training.
I'm running lstm_text_generation.py on a 4.2M text file with TensorFlow on CPU.
Observations:
On epoch 1, the loss dropped quickly to something like 2.2 and reached 1.5589 at the end of the epoch.
The text generated after epoch 1 somewhat resembled the input corpus.
On epoch 2, the loss continued dropping until it reached something like 1.4 at around 80% of the epoch. After that, the loss started to rapidly climb, reaching 1.89 at the end of the epoch.
The text generated after epoch 2 was significantly more gibberish than the one generated after epoch 1.
In the middle of the text generated after epoch 2, I see this warning:
"./lstm_text_generation.py:66: RuntimeWarning: divide by zero encountered in log preds = np.log(preds) / temperature"
I will try running with the same corpus again to see if this happens again. It will just take some time, because each epoch takes >1000 seconds. I can provide the corpus (of some Finnish text) if it helps.
Using TensorFlow backend. corpus length: 4031908 total chars: 51 nb sequences: 1343956 Vectorization... Build model...
On the second run, I saw this warning already during epoch 1, but the loss hasn't started to grow, so those might be two unrelated issues.
Ok, also on the second run the loss started to grow during epoch 2.
I also found something that looks fishy to me in the saved models, which I dumped using h5dump. Before the training started to worsen (i.e. after epoch 1 in the my original example), all the numbers, and especially the optimizer weights, were small numbers. After it went haywire, I'm seeing in the optimizer_weights group values up to around +4.8e13.
Perhaps I should open a new issue for this, if this is unrelated?
I also see this happening on my own dataset: during the first epoch the loss decreases to around 2.9, but during epoch 2 it increases to 12.xxxx
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.
I just received this error also on Ubuntu 16.04 using tensorflow-gpu 1.3 with a GTX 1080 Ti. My Python version is: Python 3.5.2 (default, Sep 10 2016, 08:21:44) [GCC 5.4.0 20160609] on linux
Here's my nvidia-smi output:
Wed Oct 4 20:32:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 0000:01:00.0 On | N/A |
| 0% 30C P8 9W / 250W | 227MiB / 11164MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1187 G /usr/lib/xorg/Xorg 163MiB | | 0 2031 G compiz 61MiB | +-----------------------------------------------------------------------------+
Training output:
Iteration 58 Epoch 1/1 200285/200285 [==============================] - 87s - loss: 14.2811
Iteration 59 Epoch 1/1 200285/200285 [==============================] - 87s - loss: 12.3760 Using TensorFlow backend. lstm_text_generation.py:63: RuntimeWarning: divide by zero encountered in log preds = np.log(preds) / temperature
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x7fa7bb303d30>> Traceback (most recent call last): File "/home/jim/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 701, in del TypeError: 'NoneType' object is not callable
I'm using a virtual environment, here's a pip list if it helps:
(tensorflow) jim@leox4-gtx780:~/code/python/tensorflow/keras_nietzsche$ pip list DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning. apturl (0.5.2) beautifulsoup4 (4.4.1) bleach (1.5.0) blinker (1.3) Brlapi (0.6.4) chardet (2.3.0) checkbox-support (0.22) command-not-found (0.3) coverage (3.7.1) cryptography (1.2.3) cycler (0.10.0) decorator (4.0.6) defer (1.0.6) feedparser (5.1.3) guacamole (0.9.2) h5py (2.7.1) html5lib (0.9999999) httplib2 (0.9.1) idna (2.0) Jinja2 (2.8) Keras (2.0.8) language-selector (0.1) louis (2.6.4) lxml (3.5.0) Mako (1.0.3) Markdown (2.6.9) MarkupSafe (0.23) matplotlib (1.5.3) nose (1.3.7) numpy (1.13.3) oauthlib (1.0.3) onboard (1.2.0) padme (1.1.1) pexpect (4.0.1) Pillow (3.1.2) pip (9.0.1) pkg-resources (0.0.0) plainbox (0.25) protobuf (3.4.0) ptyprocess (0.5) pyasn1 (0.1.9) pycups (1.9.73) pycurl (7.43.0) pygobject (3.20.0) PyJWT (1.3.0) pyparsing (2.1.10) python-apt (1.1.0b1) python-dateutil (2.5.3) python-debian (0.1.27) python-systemd (231) pytz (2016.7) pyxdg (0.25) PyYAML (3.12) reportlab (3.3.0) requests (2.9.1) scipy (0.19.1) screen-resolution-extra (0.0.0) sessioninstaller (0.0.0) setuptools (36.5.0) six (1.11.0) ssh-import-id (5.5) system-service (0.3) tensorflow-gpu (1.3.0) tensorflow-tensorboard (0.1.7) ubuntu-drivers-common (0.0.0) ufw (0.35) unattended-upgrades (0.1) unity-scope-calculator (0.1) unity-scope-chromiumbookmarks (0.1) unity-scope-colourlovers (0.1) unity-scope-devhelp (0.1) unity-scope-firefoxbookmarks (0.1) unity-scope-gdrive (0.7) unity-scope-manpages (0.1) unity-scope-openclipart (0.1) unity-scope-texdoc (0.1) unity-scope-tomboy (0.1) unity-scope-virtualbox (0.1) unity-scope-yelp (0.1) unity-scope-zotero (0.1) urllib3 (1.13.1) usb-creator (0.3.0) virtualenv (15.0.1) Werkzeug (0.12.2) wheel (0.30.0) xdiagnose (3.8.4.1) xkit (0.0.0) XlsxWriter (0.7.3)
Any ideas on how I can resolve this problem? Thanks!
Have you tried to clip the value?
preds = np.log(preds.clip(min=minval)) / temperature
Thanks for your response hdmetor and sorry for the delay in my response, I was away from my computer for a few days. I thought you had located the problem and I assumed that I had simply misread the stacktrace which pointed to the sampling part of the program (I read the last line and probably mistakenly attributed it to Keras which is very new to me), so I ran it again unmodified and reproduced the problem, but then I made the change to the sample method as you suggested by changing:
preds = np.log(preds) / temperature
to:
preds = np.log(preds.clip(min=0.0000000001)) / temperature
but it again failed with the following stacktrace, which is strangely less deep than the previous run I posted above. Maybe I'm doing something wrong and did not in fact implement your suggested change. The stacktrace from my initial post does seem to point to the np.log method, but still resulted in the stacktrace below after your suggested change:
Iteration 59
Epoch 1/1
200285/200285 [==============================] - 87s - loss: 1.3193
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x7efc83a03d30>>
Traceback (most recent call last):
File "/home/jim/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 701, in del
TypeError: 'NoneType' object is not callable
I ran the program again on the night I posted my initial problem on a computer with a GTX 780 Ti and it completed all 60 iterations without the divide by zero error on the Nietzsche text file. I then tried running it on the Shakespeare text file from Peter Norvig's site:
http://norvig.com/ngrams/shakespeare.txt
on the GTX 780 TI machine and it seemed to get stuck in some kind of local minimum and produced nonsensical output that mostly consisted of punctuation and whitespace characters and I killed it after maybe 20 iterations with no loss reduction.
Whilst I was re-running tonight on the Nietzsche text after making your suggested change I re-read this thread and saw that sliedes and gewoonrik also experienced some strange behaviour with input other than the Nietzsche text. My experience was somewhat different however, the loss did not increase but rather looked like it got stuck in a local minimum since the loss was constant over many iterations, it did not increase as they described.
I'm very new to Keras and fairly new to machine learning in general, so I'm not sure how to proceed, or what exactly is causing these problems, but I'll try to debug it further when I get a chance and see if I can narrow down the problem. In any event, thanks very much again for your response hdmetor.
I have a limited understanding of the tensorflow internals, and I can't be of any help about the stack trace. I would do one last try with a larger min
(for example min=0.0001
) to see if it's a rounding problem
@jtmckinley you should set min
with less decimal precision as suggested by @hdmetor since it's likely you are running 32-bit floating point arithmetic, which ironically leads to clipping your clipping parameter.
It's likely the bug is caused by this weird casting to float64
in the sample
function.
preds = np.asarray(preds).astype('float64')
Not sure what is the underflow behavior is for numpy floats, is but model.predict
returns an array of tf.float32
.
As well, numpy is set to ignore underflow which may lead to errors not being propagated,
You can check the error mode by running
>>> np.geterr()
{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
Thanks for responding hdmetor and jjangsangy. Sorry for my delayed response but I was waiting to get a new machine up and running at work (another GTX 1080 Ti GPU). I reproduced the problem on this new GPU (this one is an ASUS vs. my other one at home is an EVGA, but same chipset of course) by saving the model at iteration 59 and then writing a generator python script that loaded and used the model. I think jjangsangy is correct that the divide by zero error is not being caught in numpy and getting propagated up into the tensorflow code and resulting in a misleading traceback as well as my clip value getting clipped (LOL) due to float32 precision. I changed the clip value in the log to 0.0001 as suggested by hdmetor and that prevented the error.
Now I need to try and figure out what causes the strange behavior when using the Shakespeare corpus, maybe it's the formatting, all the punctuation is separated by spaces in the file from Peter Norvig's site, I'll see if I can find a different version.
Thanks again very much to both of you for your help!
I trained the model but I encountered a problem in line 65 because some values underflow in the log. Note that after the problem is encountered, I only obtain nonsensical text (more like a collection of vowels).
Such a problem is solved if the values are clipped, as in
preds = np.log(preds.clip(min=minval)) / temperature
Does anybody else have this problem? I can open a PR