scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
12.9k stars 5.13k forks source link

Segmentation fault in large scale lbfgs optimization #5102

Open borislavsm opened 9 years ago

borislavsm commented 9 years ago

Hello,

I receive a segmentation fault when I perform a large scale lbfgs optimisation in scipy. When I perform an optimisation with around 100000 parameters everything works OK, but when I do an optimisation with a significantly larger number of parameters (100000000) I get segfault.

The gdb trace is:

#0  0x00007ffff6db8466 in __memset_sse2 () from /lib64/libc.so.6
#1  0x00007fffe7c2ca29 in cauchy_ () from /usr/lib64/python2.7/site-packages/scipy/optimize/_lbfgsb.so
#2  0x00007fffe7c313f9 in mainlb_ () from /usr/lib64/python2.7/site-packages/scipy/optimize/_lbfgsb.so
#3  0x00007fffe7c3435e in setulb_ () from /usr/lib64/python2.7/site-packages/scipy/optimize/_lbfgsb.so
#4  0x00007fffe7c1dd00 in f2py_rout__lbfgsb_setulb () from /usr/lib64/python2.7/site-packages/scipy/optimize/_lbfgsb.so

OS - CentOS Python 2.7.5 numpy 1.9.2 scipy 0.12.1

Code that showcases the problem:

import numpy as np
import scipy
import scipy.optimize

N = 100000000

def df(thetav):
    df = np.random.random(N)
    score = np.random.random()
    print('score : %f ' % score)
    return (score, df)

a = np.random.random(N)
(a, _,_) = scipy.optimize.fmin_l_bfgs_b(df, a)
argriffing commented 9 years ago

Some of the Fortran parts of scipy use the Fortran integer type which is 32-bits, causing integer overflow bugs when used for data science. This problem extends throughout scipy, for example https://github.com/scipy/scipy/issues/5064.

borislavsm commented 9 years ago

Thanks for the reply!

If the problem is integer overflow what is the maximal size of the derivative vector that can be used?

Just to note that the optimizer usually performs several steps and only then it crashes.

argriffing commented 9 years ago

Here's the Fortran code: https://github.com/scipy/scipy/blob/master/scipy/optimize/lbfgsb/lbfgsb.f.

If the problem is integer overflow what is the maximal size of the derivative vector that can be used?

The expression 2*m*n + 11m*m + 5*n + 8*m appears in the comments at the top of that Fortran code, where n is 'the dimension of the problem' and m is 'the maximum number of variable metric corrections used to define the limited memory matrix`. I would guess that if this value starts to overflow 32 bits then bad things could start to happen, but I'm not 100% sure that this is the source of the problem that you are seeing.

In my opinion the best way to fix this would be to find compatibly licensed updates of this code elsewhere and re-introduce the code into scipy from that other source, if this exists. The lbfgsb.f source is 2011 Fortran code which has probably found its way into other projects, some of which may have already patched it to be web-scale. But it looks like the most recent official version is still 3.0 http://users.iems.northwestern.edu/~nocedal/lbfgsb.html.

argriffing commented 9 years ago

@borislavsm You could try running https://github.com/stephenbeckr/L-BFGS-B-C. A header file has a note about integer sizes https://github.com/stephenbeckr/L-BFGS-B-C/blob/master/src/lbfgsb.h#L10.

larsmans commented 9 years ago

There's also liblbfgs, which is a much more cleaned-up translation of L-BFGS-B to C. I have an (incomplete/unmaintained) Python wrapper for that code. Actually this offers L-BFGS only, without bounds constraints. I guess it's a not a proper replacement then.

person142 commented 8 years ago

I know I'm a bit late to the party, but here goes. I downloaded the original L-BFGS-B code:

http://users.iems.northwestern.edu/~nocedal/lbfgsb.html

It has a test routine driver1.f which minimizes the generalized Rosenbrock function. The default test size is n = 25. I took @borislavsm's n = 100000000 and ran make; of course the code didn't compile (complained about overflow). But after adding the flags -fdefault-integer-8 (to make the default integer size 64 bit) and -mcmodel=large (an x86 specific flag to allow accessing memory beyond 2 GB) the code compiled just fine, and when I ran it on the test case in driver1.f with n = 100000000 it converged.

This is rather tentative, and I need to test it more, but perhaps the issue can be solved simply by changing how the code is compiled?

Sorry if I'm missing something and am talking nonsense.