haasad / PyPardiso

Python interface to the Intel MKL Pardiso library to solve large sparse linear systems of equations
BSD 3-Clause "New" or "Revised" License
135 stars 20 forks source link

Segfaults and memory access violations on CI services #4

Closed haasad closed 8 years ago

haasad commented 8 years ago

The builds on travis-CI sometimes fail with a segfault, even after all tests have passed. This happens for 3.4 and 3.5, but only sporadically. This could be caused by MKL, pytest or travis. A quick google-search shows that some interaction between pytest and MKL's memory management might cause this.

A similar thing appens on appveyor, where the exit code -1073741819 also seems to be connected to MKL's memory management.

In one case there are also memory access violations, where only some of the tests fail.

Possibly this could be fixed by releasing the memory of the solver with PyPardisoSolver.set_phase(-1), but there might be a more serious underlying problem in the wrapper.

haasad commented 8 years ago

After more than two weeks of bug-hunting I have finally managed to fix this issue. I guess this is what can happen when people who have no experience whatsoever with manual memory management start to work with C-code. I'll write up a short summary of what I did for future reference for myself (and maybe people that stumple upon this while googling segfaults and python wrappers).

The first approach was to find a case, where the segfault is reproducible. I found some ways that appeared to increase the likelihood of a segfault:

Nevertheless the segfault only appeared sporadically and I couldn't find a case to always reproduce it. The error also happed way more often on the CI services than on my machine, which made it even more difficult. On osx and linux it mostly crashed after the tests have run and on windows often at the beginning or in between. I suspected that there might even be two different causes for the segfault. These two types of segfaults are also what @cardosan described in #5.

I then set up gdb on linux to track down the reason for the segfaults.

Typical ouput of gdb looked like this: gdb pypardiso Bottom line of all of the gdb output was that the segfault happens with the garbage collector. But neither more garbage collection, nor deactivating the cyclic collection made the segfaults go away. I then suspected that somewhere an array is garbage collected in my python code which was still in use by the solver. This idea led me down a rabbit hole of reading everything I could about malloc and free in C and garbage collection and reference counting in CPython. This then led to looking through the numpy, scipy and ctypes source code to find the place in my code where something would be garbage collected, because it only uses a weakref for example or no reference is kept to an object that is still in use.

When randomly following every hint in the gdb output didn't bring any success, @thomasst advised me to use valgrind. I had failed before to get valgrind working with python, but I gave it another go and this was the outcome of running a minimal example of the wrapper with valgrind: linux full valgrind output

==2106== LEAK SUMMARY: ==2106== definitely lost: 162,732 bytes in 85 blocks ==2106== indirectly lost: 0 bytes in 0 blocks ==2106== possibly lost: 536,540 bytes in 223 blocks ==2106== still reachable: 1,741,984 bytes in 3,622 blocks

Turns out that i have a memory leak in my program. While some of this is due to valgrind's inability to deal with the python memory management (or my inability to properly set up valgrind with python), I found out the the memory allocated by MKL is also leaked. I then focused more on the code directly connected to the mkl pardiso function.

It turns out that the official Intel MKL documentation played a trick on me: Here it is written that the pt input parameter for pardiso can be an array of:

INTEGER for 32-bit or 64-bit architectures INTEGER*8 for 64-bit architectures

where INTEGER is a fortran 32bit int and INTEGER*8 is a fortran 64bit int. Pypardiso is supposed to work on 32-bit and 64-bit architectures and therefore I used 32-bit ints. I assumed that other people used 64bit, because they only develop for 64-bit architectures. MKL uses the pt-array to store internal memory addresses. I guess there are cases where the memory address on a 64-bit architecture can become larger than a 32-bit int, which leads to a segfault because the pt-array can't hold this address. The chances for this probably increase the longer you use the solver and also differs between operating systems, which explains why it was impossible to reproduce and why it only happend sometimes.

In hindsight I assume this would have been obvious from the beginning if I had any experience with memory management. It makes a lot of sense that 64-bit operating systems needs 64-bit memory addresses. It's just really unfortunate for me, that the operating system seemed to be perfectly happy with small numbers as memory addresses that fit into 32-bit most of the time.

This is now fixed in v0.2.0

haasad commented 8 years ago

Some more discussion with @thomasst revealed that I haven't interpreted the reason for the segfaults correctly. Below is an example of the behaviour of the pt-array after the fix (ie. as a 64bit int array):

>>> import numpy as np
>>> import scipy.sparse as sp
>>> import ctypes
>>> import pypardiso
>>> ps = pypardiso.scipy_aliases.pypardiso_solver
>>> ps.pt
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> ps.solve(sp.eye(1, format='csr'), np.ones(1))
array([ 1.])
>>> ps.pt
array([         1, 4321438848,          0, 4403353216,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0,          0,          0,
                0,          0,          0, 4321871488])
>>> ps.pt.dtype
dtype('int64')

If I do the same in pypardiso v0.1.0, the pt-array it looks like this:

>>> ps.pt
array([      1,       0, 8738432,       1,       0,       0, 4894592,
             1,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0], dtype=int32)

Using some black magic by @thomasst, I can access the memory location of the pt-array beyond the boundaries of the 32-bit int array und load it as a 64-bit int array:

>>> def array_memory_address(a):
...     obj = ctypes.py_object(a.data)
...     address = ctypes.c_void_p()
...     length = ctypes.c_ssize_t()
...     ctypes.pythonapi.PyObject_AsReadBuffer(obj, ctypes.byref(address), ctypes.byref(length))
...     return address
...
>>> address = array_memory_address(ps.pt)
>>> np.frombuffer(ctypes.string_at(address, 64*8), np.int64)
array([           1,   4303705728,            0,   4299861888,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
         8589934593,            1,            0,            0,
        55834574848,            1,            1, 605590388877,
         8589934592,            0,            0,            0,
                  0,            0,            0,            0,
        -4294967296,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
         4294967295,            0, 614180323469,   4294967319])

This demonstrates how the Pardiso library writes to the same places in memory (ie. pt[0], pt[1], pt[3] and pt[63]) as in the fixed version above. What is really interesting is what happens when we load the same block of memory as np.int32:

>>> np.frombuffer(ctypes.string_at(address, 64*8), np.int32)
array([      1,       0, 8738432,       1,       0,       0, 4894592,
             1,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       1,       2,       1,       0,       0,       0,
             0,       0,       0,      13,       1,       0,       1,
             0,     141,     141,       0,       2,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,      -1,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,      -1,       0,       0,       0,     141,     143,
            23,       1], dtype=int32)

Now it becomes clear that the second half is actually the iparm array. The python memory manager apparently stores these two arrays adjacent to each other in memory, because they are both of the same size and type (this is not the case in the new version of pypardiso):

>>> ps.iparm
array([  1,   2,   1,   0,   0,   0,   0,   0,   0,  13,   1,   0,   1,
         0, 141, 141,   0,   2,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,  -1,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,  -1,   0,   0,   0, 141, 143,  23,   1], dtype=int32)
>>> array_memory_address(ps.pt)
c_void_p(4466231120)
>>> array_memory_address(ps.iparm)
c_void_p(4466231376)

It turns out that it only looked like the last element of the pseudo-int64 array was a valid memory address:

>>> np.frombuffer(ctypes.string_at(address, 64*8), np.int64)
array([           1,   4303705728,            0,   4299861888,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
         8589934593,            1,            0,            0,
        55834574848,            1,            1, 605590388878,
         8589934593,            0,            0,            0,
                  0,            0,            0,            0,
        -4294967296,            0,            0,            0,
                  0,            0,            0,            0,
                  0,            0,            0,            0,
         4294967295,            0, 614180323470,   4294967319])
>>> last_one = np.frombuffer(ctypes.string_at(address, 64*8), np.int64)[-1]
>>> last_one
4294967319
>>> np.frombuffer(ctypes.string_at(array_memory_address(last_one), 8), np.int32)
array([23,  1], dtype=int32)
haasad commented 8 years ago

After a lot of reverse engineering, I think I finally kind of understand what happened: The desired layout in memory for the pt- and the iparm-array looks like this (one - symbolizes is 8 bytes, which means one int64 or two int32):

----------------------------------------------------------------
--------------------------------

What really happend in the old version of pypardiso is this:

--------------------------------================================

Because the pt-array was passed as 32bit-array instead of 64bit, the second half is overwritten by the iparm array. The pardiso solver uses these 0,1,3 and 63-element of the pt-array to store information like this:

xx-x-----------------------------------------------------------X
xx-x----------------------------oooooooooooooooooooooooooooooooo <- second half overwritten by iparm

As shown in the comment above, elements 1,3 and 63 contain memory addresses. Reading these memory locations into numpy-arrays revealed the following:

>>> ps.pt
array([         1, 4321438848,          0, 4403353216,          0,
                ...
                0,          0,          0, 4321871488])

Because pt[1] and pt[3] are in the first half of pt, they were safe from being overwritten, pt[63] is never used by the solver and because of this the solver worked without any problem. However one thing is a really unlucky coincidence: the 64bit pt[63]is overwritten by the two 32bit ints iparm[62]and iparm[63].

                                                               ↓
xx-x----------------------------oooooooooooooooooooooooooooooooX

Iparm[62] is the

Size of the minimum OOC memory for numerical factorization and solution. This parameter provides the size in kilobytes of the minimum memory required by OOC Intel MKL PARDISO for internal floating point arrays. This parameter is computed in phase 1.

and iparm[63] is a reserved parameter that is set to zero. When the solver interprets these two 32bit ints as the 64bit pt[63] this yields a valid memory address most of the time:

>>> iparm62and63 = ps.iparm[62:]
>>> iparm62and63
array([23,  1], dtype=int32)
>>> wrong_address = np.frombuffer(ctypes.string_at(array_memory_address(iparm62and63), 8), np.int64)
>>> wrong_address
array([4294967319])
>>> int(wrong_address) - 2**32
23
>>> np.frombuffer(ctypes.string_at(ctypes.c_void_p(int(wrong_address)), 8*10), np.int64)
array([              34048,      79164837206272, 6504683582653423360,
                   5198405,                   0,       1099511627776,
                         0,                   0,                   0,
                         0])

pt[63] is now a pointer to a random memory location, which varies depending on the value of iparm[62], but somehow seems to produce a valid address most of the time. Because pt[63] isn't actively used this seems to be ignored until the solver tries to shutdown. Pardiso expects that pt[63] points to a memory location that was allocated by the solver and now needs to be freed. I don't fully understand how memory is freed on the C-level, but I assume that now it depends on where the random address created by iparm[62]and iparm[63] points to. The solver will now try to free things at this memory location and depending on what is there, this will lead to a segmentation fault.
This memory location varies depending on the size of matrix A and b (this changes iparm[62]) and whatever is present at the wrong pt[63]-address is bound to be different between the operating systems. These weird interactions created a bug that was almost impossible to track down, I'm really happy that this is finally solved.

@cardosan and @cmutel: I'm not sure how easy it is to follow my ramblings, but I think it is safe to say that the results of pypardiso-v0.1.0 were correct, despite this obscure bug and the occasional segfaults.