NowayIndustries commented 7 years ago

Hey everyone,

I am in dire need of a sanity check, for I have stumbled upon a problem and am not quite sure how to proceed (apart for using tensorflow from now on).

TL;DR: AKA: The actual questions

I am seeing an insane speed up in training time when using tensorflow instead of theano.

Is this normal magpie behavior? (in other words: is magpie optimized for tensorflow)
Which do you use?
Have you compared the two recently?
Any other things I should test to make sure I'm not messing up somewhere and causing this?

If you are reading this, and you use Theano, you might want to consider trying tensorflow.

If this turns out to be expected behaviour the readme should probably contain this tidbit. I was using Theano all this time because I started using magpie when Tensorflow released it's 1.0 version that broke keras (they changed around parameter positions iirc).

My apologies in advance, but if you want some more details you'll have to read the long version.

The long version

Background

I was finalizing and testing my deployment script (again), which installs some system packages (pip, python3, virtualenv), creates a virtualenv, installs the selected backend (theano/tensorflow/tensorflow-gpu) , magpie and it's dependencies and the dependencies for the project I have created. And some other autostarting magic for when the system boots.

A problem occurs, not the actual content of this question though

This time it pulled in Theano 0.9.0 instead of the older 0.8.2 (which I was using before this point). When attempting to train a magpie model via my project it would throw a generic error in the background process without stacktraces or anything.

Epoch 1/5
Fatal Python error: PyEval_SaveThread: NULL tstate

Googling the error did not reveal too much, besides that this call is aparantly part of the C-API in python. Suspecting the fault was in the way I am using multiprocessing I switched to the tensorflow backend to see if I could reproduce the error.

The actual problem/strange thing

The error disappeared with tensorflow (1.1.0), I did however notice something strange. The same corpus with the same settings and the same raw-data collection of texts (which gets split into training and validation data by my project when starting training) would run a single epoch in about 30 seconds. In Theano this takes roughly 982 seconds according to the first prediction and ended up taking about 1200 seconds (20 minutes).

Checking and double-checking

Now a small speed difference I could accept, ignore and move on with. This is however showing that tensorflow is at least 30x faster compared to Theano on the same corpus (Train on 12528 samples, validate on 3794 samples for tensorflow, Train on 12517 samples, validate on 3805 samples for theano). This is insane and in other tests I have run since I have only gotten similar results. For these tests I have switched back to the 0.8.2 version of Theano because it works for me.

In checking the output of the model the tensorflow model seems fine and relevant to the texts I put into it. Some percentages are lower (compared to the theano model), but I'll chalk that one up to the fact that every training session is different when started via my project. So it's not that the tensorflow model is only pretending to be successfully training the model and just blazing through the training set without doing anything. (Tensorflow does warn me about not using a version with the available CPU Instruction sets enabled, but that would only make it faster still...)

Hardware information & OS

All of this training is being done on the CPU inside an container provided to me by the company I am working for, with 16 cores of an E5-1660 v4 and 251G of ram (although I am pretty sure I will not be able to use all of that RAM and it's just reporting the total amount of RAM in the physical system, I can pull about 1400% cpu usage on my python process so that's nice.) Using Ubuntu 16.04.2 LTS, fully up to date.

PIP list

A online diff list for our pleasure, tensorflow virtualenv on the left, theano virtualenv on the right, as we can see only the requests package & theano differ between the two different virtualenvs. I don't think this would be the source of the incredible speedup

tensorflow virtualenv

appdirs (1.4.3)
boto (2.47.0)
bz2file (0.98)
certifi (2017.4.17)
chardet (3.0.3)
click (6.7)
clickclick (1.2.1)
connexion (1.0.129)
decorator (4.0.11)
Flask (0.12.2)
Flask-HTTPAuth (3.2.2)
gensim (0.13.4.1)
h5py (2.7.0)
idna (2.5)
ipython-genutils (0.2.0)
itsdangerous (0.24)
Jinja2 (2.9.6)
jsonschema (2.6.0)
jupyter-core (4.3.0)
Keras (1.2.2)
magpie (0.2)
MarkupSafe (1.0)
nbformat (4.3.0)
nltk (3.2.4)
numpy (1.12.1)
packaging (16.8)
pip (9.0.1)
pkg-resources (0.0.0)
plotly (2.0.8)
protobuf (3.3.0)
pyparsing (2.2.0)
python-dateutil (2.6.0)
pytz (2017.2)
PyYAML (3.12)
requests (2.17.3)
scikit-learn (0.18.1)
scipy (0.19.0)
setuptools (35.0.2)
six (1.10.0)
smart-open (1.5.3)
strict-rfc3339 (0.7)
swagger-spec-validator (2.1.0)
tensorflow (1.1.0)
Theano (0.9.0)
traitlets (4.3.2)
urllib3 (1.21.1)
Werkzeug (0.12.2)
wheel (0.30.0a0)

theano virtualenv

appdirs (1.4.3)
boto (2.47.0)
bz2file (0.98)
certifi (2017.4.17)
chardet (3.0.3)
click (6.7)
clickclick (1.2.1)
connexion (1.0.129)
decorator (4.0.11)
Flask (0.12.2)
Flask-HTTPAuth (3.2.2)
gensim (0.13.4.1)
h5py (2.7.0)
idna (2.5)
ipython-genutils (0.2.0)
itsdangerous (0.24)
Jinja2 (2.9.6)
jsonschema (2.6.0)
jupyter-core (4.3.0)
Keras (1.2.2)
magpie (0.2)
MarkupSafe (1.0)
nbformat (4.3.0)
nltk (3.2.4)
numpy (1.12.1)
packaging (16.8)
pip (9.0.1)
pkg-resources (0.0.0)
plotly (2.0.8)
protobuf (3.3.0)
pyparsing (2.2.0)
python-dateutil (2.6.0)
pytz (2017.2)
PyYAML (3.12)
requests (2.16.5)
scikit-learn (0.18.1)
scipy (0.19.0)
setuptools (35.0.2)
six (1.10.0)
smart-open (1.5.3)
strict-rfc3339 (0.7)
swagger-spec-validator (2.1.0)
tensorflow (1.1.0)
Theano (0.8.2)
traitlets (4.3.2)
urllib3 (1.21.1)
Werkzeug (0.12.2)
wheel (0.30.0a0)

Thank you for reading my novel, if you would like to support the continued production of these types of novellas and overly long questions wrapped in so much background information you forget the original question please support me on patreon. Just kidding, I don't have a patreon.

Thanks again for reading and possibly answering any/all of my questions.

jstypka commented 7 years ago

Magpie is not optimized for TensorFlow or Theano in any way. Neither is Keras as far as I know, the two backends are abstracted away and used as engines for heavy-lifting computations.

Both Theano and TensorFlow are similar when it comes to performance (one should not be much optimized than the other), but the trick is that both of them can perform very differently depending on how they're set up. In my case, Theano by default didn't recognize many cores on my machine and used only one, while TensorFlow took advantage of the whole CPU out-of-the-box. I'm pretty sure Theano is able to leverage many cores, but I didn't care to set it up at the time.

And I'm pretty sure this is a similar issue here - TensorFlow clearly uses some resources here that Theano doesn't, giving you this kind of speed up. Maybe many cores? Maybe a GPU? Maybe something with memory? It all depends on your set up - also newer versions of the libraries probably are better with detecting the optimal settings for a machine.

NowayIndustries commented 7 years ago

This "performance between the tensorflow and theano is similair" sentiment is indeed what seems to be the mainstream on the internet aswell, which is primarily why I came to ask here.

As far as Theano setup goes I'm not sure what I did exactly, as far as I can recall I installed it, changed keras.json and set the flag to use all available cores. As for the tensorflow setup I just installed it and changed the keras.json file again. I guess tensorflow has better defaults for my hardware, probably the instructionsets, as it warns me about them frequently.

I guess I'll be periodically testing both backends after updates to see which one wins this time. I will just be using tensorflow for now, seeing as it is currently performing the best (and the new theano version doesn't run at all in my project).

Theano does use all my cores (very easy to enable, just have to set theano.config.openMP to true.) and it runs on the same hardware as tensorflow. There is no GPU available to the container (yet, although with this training speed it might not need it) and there should be plenty of memory.

I was hoping for an easy answer (aren't we all), I guess I'll find out one day, but today is not that day and that is okay. Thank you for giving this wall of text a read and telling me that it's probably user error :)

inspirehep / magpie

Sanity check: Tensorflow vs Theano 30x training speed difference? #98