Closed bdice closed 9 years ago
Original comment by Joshua Anderson (Bitbucket: joaander, GitHub: joaander).
OK, my minimal example is much more minimal:
void LocalDensity::computePy(trajectory::Box& box,
boost::python::numeric::array ref_points
boost::python::numeric::array points)
{
/* actual compute code here
*/
}
I commented out any code that would actually do anything in compute() in LocalDensity.
Then I ran this test:
self.box = trajectory.Box(10);
self.pos = numpy.array(numpy.random.random(size=(10000,3)), dtype=numpy.float32)*10 - 5
self.ld = density.LocalDensity(3, 1, 1);
self.ld.compute(self.box, self.pos, self.pos);
This segfaults on petry 3 out of every 4 runs with the backtrace I posted above. If I change boost::python::numeric::array
to boost::python::object
, it no longer segfaults. However, if I instantiate a local variable boost::python::numeric::array ref_points = extract<boost::python::numeric::array>(ref_points_in);
, it segfaults again.
My conclusion is that the problem is somehow caused in garbage collecting a numpy array that has been touched by boost::python::numeric::array
.
Searching by keywords on the line of code at dictobject.c:1379
led me to a python bug in 2.7.5 where the garbage collector was crashing when trying to clean up code allocated in another module. This bug has been fixed in the python 2.7 series (not that we ever triggered it to my knowledge). I couldn't find references to any such bug in python 3.4.
In any case, this appears to be either a python or boost (or the way the interact) bug and not how we are using them in freud. I am going to roll the linux boxes back to python 3.3. That is the quickest way to work around the problem for now. I certainly don't have the time to try and track down subtle memory manager issues in python far enough to submit a bug.
Original comment by Richmond Newman (Bitbucket: newmanr, GitHub: newmanr).
Giant memory leak indeed. Most of the modules I use for production runs reuse, so they (far as I know) aren't afflicted by the problem w/ being deallocated in the middle of the script. I'm honestly not sure though what to do to fix these errors, since it seems to exist outside the scope of our code.
Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).
Hi, when I tested with GDB I was not aware of the stochastic nature of the problem. So I am pretty sure the problem would occur eventually.
One quick word of warning: I proposed this pointer array to prevent deallocation as a quick test to pinpoint the source of error. We should definitely not use it as a workaround in any kind of productive environment as it presents a massive memory leak.
Original comment by Richmond Newman (Bitbucket: newmanr, GitHub: newmanr).
#!python
from freud import trajectory
from freud import sphericalharmonicorderparameters as shop
import numpy
import time
import gc
import weakref
def testCase():
#Cubic box
L = 10.0;
box = trajectory.Box(L);
#sl = shop.SolLiqNear(box, 2.0, 10.0, 6, 6, 12);
#pointers = []
#Ideal gas in box
Np = 500;
for i in range(10):
print("Iteration {}".format(i));
sl = shop.SolLiqNear(box,2.0,10.0,6,6,12);
print("sl refcountafterconstruction is {}".format(len(gc.get_referrers(sl))));
L=10+numpy.random.rand(); #10-11
box = trajectory.Box(L);
sl.setBox(box);
#Ideal gas in box
xyz = L*2*(numpy.random.rand(Np,3)-0.5);
xyz = xyz.astype(numpy.float32);
sl.computeSolLiqNoNorm(xyz);
#print("Sl refcount after compute is {}".format(len(gc.get_referrers(sl))));
pointers.append(sl);
#del sl;
print("Sl refcount from pointers is {}".format(len(gc.get_referrers(pointers[-1]))));
#print("Attempting to deallocate pointers array");
#del pointers;
#print("Deallocation complete");
if __name__ == "__main__":
pointers = [];
#gc.set_debug(gc.DEBUG_SAVEALL);
for i in range(10):
testCase();
gc.set_debug(gc.DEBUG_SAVEALL);
print("End Script");
I did manage, however, to write one script that I can't seem to segfault on my iMac, included above. Herein I throw all the created objects into a giant list, which normally would segfault on deallocation at the end of the script. However, if you set some debug parameters to the garbage collector, this seems to prevent that from happening. However, regardless of when this parameter is set (beginning or end), if objects are destroyed during the script, it will still segfault. Sigh.
Original comment by Richmond Newman (Bitbucket: newmanr, GitHub: newmanr).
My new iMac (gatesbrown, OSX yosemite, python 3.4.2) is afflicted with segfaults when using freud too. If you ensure that any used freud module is either reused, or kept in memory until the end of the script, your work will at least complete before segfaulting on exit. Not a real fix of course, just a way to live with it temporarily.
I tried creating a boost python example test program, (entirely independent of freud) and get the same errors. I cannot safely deallocate a class after a call to one of its member functions passing in a boost::python::numeric array (even if said function is empty and does nothing with it). So I guess that's the same as Josh's findings.
Simon: When you can't get it to crash in gdb, did you test a few times? I found that I couldn't get scripts to crash within GDB, but they would segfault (stochastically but with moderate probability) upon exiting.
I tried creating an example outside of freud as well, but the results remained the same. Calling a function that takes in a boost::python::nuermic::array, even a function that does nothing will trigger the problem. If the module you're using is never deallocated, then
Original comment by Joshua Anderson (Bitbucket: joaander, GitHub: joaander).
Here is the backtrace:
0x00007ffff79ec308 in dict_dealloc (mp=0x7ffff5fd0788)
at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c:1379
1379 /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c: No such file or directory.
(gdb) bt
#0 0x00007ffff79ec308 in dict_dealloc (mp=0x7ffff5fd0788)
at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/dictobject.c:1379
#1 0x00007ffff79f5f67 in module_dealloc (m=0x7ffff5fbd548)
at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/moduleobject.c:398
#2 0x00007ffff79f5967 in meth_dealloc (m=0x7ffff5fae108)
at /var/tmp/portage/dev-lang/python-3.4.2/work/Python-3.4.2/Objects/methodobject.c:150
#3 0x00007ffff73c8fb9 in __run_exit_handlers (status=0, listp=0x7ffff772f5a8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#4 0x00007ffff73c9005 in __GI_exit (status=<optimized out>) at exit.c:104
#5 0x00007ffff73b2dcc in __libc_start_main (main=0x4009a0 <main>, argc=2, argv=0x7fffffffd248, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd238) at libc-start.c:319
#6 0x0000000000400bb9 in _start ()
The minimum ingredients needed appear to be python + boost + calling a function that takes a boost::python::numeric::array input.
Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).
This is Richmond's test case.
from freud import trajectory
import numpy
from freud import locality
#Cubic box
L = 10.0;
box = trajectory.Box(L);
lc = locality.LinkCell(box,2.0);
#Ideal gas in box
Np = 100;
xyz = L*2*(numpy.random.rand(Np,3)-0.5);
xyz = xyz.astype(numpy.float32);
#Compute at least once (required for segfault)
lc.computeCellList(box,xyz);
print("End Script Test LinkCell");
Original comment by Joshua Anderson (Bitbucket: joaander, GitHub: joaander).
I've got a more reliable reproducer now. I get the issue with python 3.4.1 (tested 3.4.2 as well). But I do not get the issue with 2.7.
Who all is affected by this issue? Is it just on the vis lab machines? Should I set the default python to 2.7 on these systems?
Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).
I'm currently porting my scripts to python3.4 on collins when I encountered this bug.
The bug occurs when I try to calculate the RDF from a previously read XMLDCDTrajectory.