NUTS performance broke with version 3.2

michaelosthege commented 6 years ago

Description

I recently noticed that NUTS performance became terrible on my ODE models. Here is an example with the Lorenz attractor, implemented such that NUTS can be used to sample the initial conditions (x, y, z) and parameters (a, b, c) of the model.

Example code: https://gist.github.com/michaelosthege/a75b565d3f653721fa235a07eb089912

Traceback

Here I ran

>>> conda create -n venvtest python=3.5 pymc3
>>> python lorenz.py
>>> pip install "pymc3==3.2"
>>> python lorenz.py

In between I Ctrl+Ced the first run...

(venvtest) C:\Users\zufal\Repos\...\experiments>python lorenz.py
pymc3 version 3.1
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '1760' (I am process '174048')
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.
Average Loss = 1.0204e+05:   0%|6       ...lots of spaces...    | 730/200000 [01:39<7:31:27,  7.36it/s]forrtl: error (200): program aborting due to control-C event
Image              PC                Routine            Line        Source
libifcoremd.dll    00007FFF43DF94C4  Unknown               Unknown  Unknown
KERNELBASE.dll     00007FFF7D638C8D  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFF80753F84  Unknown               Unknown  Unknown
ntdll.dll          00007FFF80D313B1  Unknown               Unknown  Unknown

(venvtest) C:\Users\zufal\Repos\...\experiments>pip install "pymc3==3.2"
Collecting pymc3==3.2
  Using cached pymc3-3.2-py3-none-any.whl
Requirement already satisfied: theano>=0.9.0 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: joblib>=0.9 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: tqdm>=4.8.4 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: patsy>=0.4.0 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: h5py>=2.7.0 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: six>=1.10.0 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: pandas>=0.18.0 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pymc3==3.2)
Requirement already satisfied: numpy>=1.9.1 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from theano>=0.9.0->pymc3==3.2)
Requirement already satisfied: scipy>=0.14 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from theano>=0.9.0->pymc3==3.2)
Requirement already satisfied: python-dateutil>=2 in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pandas>=0.18.0->pymc3==3.2)
Requirement already satisfied: pytz>=2011k in c:\users\zufal\miniconda3\envs\venvtest\lib\site-packages (from pandas>=0.18.0->pymc3==3.2)
Installing collected packages: pymc3
  Found existing installation: pymc3 3.1
    Uninstalling pymc3-3.1:
      Successfully uninstalled pymc3-3.1
Successfully installed pymc3-3.2

(venvtest) C:\Users\zufal\Repos\...\experiments>python lorenz.py
pymc3 version 3.2
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 10%|###################5      ...lots of spaces...    | 99/1000 [05:50<53:12,  3.54s/it]

Versions

If you scroll to the right, you can see the performance indicators;

with pymc3 3.1 it runs at 7.36 it/s
with pymc3 3.2 it runs at 3.42 s/it so 0.29 it/s (The 0.29 it/s is even a very optimistic estimate, because the first few iterations are much faster.)

Both of these are in the same freshly installed environment. Python 3.5 on Windows. On my Linux machines I get the same slowdown, but did not bother to test the version specifically. (They also ran fine, just a while ago.)

Analysis

With the Visual Studio performance profiler, I can see, that within _Tree._build_subtree a lot of time is spent in pymc3.model.ValueGradFunction.__call__/.../theano.scan_module.scan_perform.perform. I assume this is not the tt.scan that I use for ODE-solving, because if I manually assign step=pymc3.Slice() it will still use tt.scan in the forward pass, but the performance is stable at 1.64 it/s.

I am aware that an ODE system is quite an exotic model for pymc3, but I get the feeling that the performance drawbacks I see here are merely an amplification of performance issues that others have observed (eg. https://github.com/pymc-devs/pymc3/issues/2723#issuecomment-345982298)

junpenglao commented 6 years ago

ODE are difficult to initialize, you might want to set the init='advi' for comparison under pymc3.2

michaelosthege commented 6 years ago

Well spotted! This indeed solves the problem. Now it's even slightly faster at 8.09 it/s! Thank you very much!

aseyboldt commented 6 years ago

Just a note on that performance difference: The number of samples per second is not a good measure for sampler performance. What we are interested are effective samples per second, and that might (or might not) be better with the adapted mass matrix than with the one provided by advi.

About your model: I didn't go through it in detail, but it looks like you implemented the gradient of the ode by moving the runge-kutta integrator into theano. I'm not sure if this is the best way to do this. I've only ever done this in the context of PDEs, but it probably is better to solve the adjoint ode instead of the original one to compute the gradient. That way you can even use any ode solver you like. Here is a short intro.

You are using uniform priors on some parameters. In many cases this makes life harder for the sampler, as the geometry of the posterior can get strange at the boundaries. Unless you need hard limits, something like a normal is better most of the time.

michaelosthege commented 6 years ago

Thank you for the suggestions - I will have a look at it.

I noticed the issue because the NUTS-sampling in my sampler-comparison workflow suddenly took ages. (It slowed down by almost 100x)

sampler comparison

The workflow uses different samplers on the same model, checks convergence and in the end it will plot the effective sampling rate.

With this particular model (not the Lorenz attractor) I expect NUTS to be the fastest. However, DEMetropolis is remarkably fast (N_effective/h) too.

pymc-devs / pymc