RDF accumulate leaks memory and crashes

bdice commented 6 years ago

Original report by Åsmund Ervik (Bitbucket: asmunder, GitHub: asmunder).

I am trying to make a very accurate computation of the RDF from a big-ish system of hard spheres (3D). I am using HOOMD-blue for the HPMC, and it works well, with ~22k spheres and 10e6 sweeps. I'm trying to compute RDF using data every 1000 steps, with a cutoff of 8 and a step of 0.001.

But when I try to compute the RDF using Freud following the tutorials, either online in the simulation through a callback or on a saved trajectory with many steps, Freud uses up all available memory as it is iterating through the snapshots, and then crashes with "MemoryError: std::bad_alloc" after 74 of the 1e4 snapshots in my trajectory. To my understanding, once it's finished with one snapshot, that memory should be freed.

Currently I'm working around this by running one script with an outer loop that uses subprocess.Popen to spawn another Python script that runs the RDF computation on 10 snapshots at a time, then saves the result to a .npy file and exits. In the outer loop, I then add up the RDFs and average at the end. This works fine, but it's of course an ugly hack.

I'm not doing anything special, so you should be able to replicate just by taking the example linked below, switching to 3D, increasing the number of spheres, setting rmax=8.0, dr=0.001, and doing enough callbacks (around 74 on my 64GB machine).

https://github.com/joaander/hoomd-examples/blob/master/Analysis%20-%20Quantitative%20-%20Online%20analysis%20with%20Freud.ipynb

bdice commented 6 years ago

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).

To clarify what we learned, the problem wasn't actually the circular references, it was that not enough things were happening at the python level to trigger garbage collections, which are required to clean up objects with circular references (adding periodic calls to gc.collect() should fix the observed behavior without updating freud in this case). The solution we opted for is to explicitly break the circular reference for automatically-generated neighbor lists so they get cleaned up immediately by reference counting.

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

@asmunder thanks again for finding this. We now have a fix on the master branch. We'll aim to make a bugfix release soon, but if you would like something immediately then feel free to clone the repo and confirm that this fix works.

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Merged in issue169 (pull request #153)

Fix nlist memory issues; fixes issue #169

Approved-by: Bradley Dice bdice@bradleydice.com Approved-by: Vyas Ramasubramani vramasub@umich.edu

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Merged in issue169 (pull request #153)

Fix nlist memory issues; fixes issue #169

Approved-by: Bradley Dice bdice@bradleydice.com Approved-by: Vyas Ramasubramani vramasub@umich.edu

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Fix nlist memory issues; fixes issue #169

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Fix nlist memory issues; fixes issue #169

bdice commented 6 years ago

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Thanks for reporting!

Thanks for the info Matt. I'll try and reproduce the problem and hopefully this fixes it.

bdice commented 6 years ago

Original comment by Matthew Spellings (Bitbucket: mspells, GitHub: klarh).

Some notes for @vyasr and @bdice :

Extra memory usage is due to creating the default neighbor list. In particular the NeighborList object doesn't seem to be garbage collected because it points to the CellList that created it through its base parameter. This was intended to prevent garbage collection problems, but seems to have brought its own (possibly due to some unintuitive behavior of how refcounting works inside cython?). Adding something like:

#!python

if nlist is None:     
    nlist_.base = None

to the end of RDF.accumulate seems to fix the problem.

glotzerlab / freud

RDF accumulate leaks memory and crashes #169