mosdef-hub / mbuild

A hierarchical, component based molecule builder
https://mbuild.mosdef.org
Other
174 stars 81 forks source link

RecursionError when using multiprocessing pool with an mBuild compound #786

Open uppittu11 opened 4 years ago

uppittu11 commented 4 years ago

Bug summary I get a pickling recursion error when trying to use multiprocessing in a script with an mBuild compound. Is this a bug in mBuild? Is there a better multiprocessing package or something to get around this?

Code to reproduce the behavior This is an example use of multiprocessing with mBuild in which I'm trying to identify the overlapping waters when placing a solute in a water box. I realize that I could probably just do this in packmol, but I'm sure we will run into more cases where multiprocessing might be useful, especially for really large systems.

https://gist.github.com/uppittu11/7d46029f7627db890f6ab9a35dec80bc

Software versions mBuild version: mBuild 0.10.5 | from the mosdef channel python version: Python 3.7.8 | packaged by conda-forge OS: MacOS Mojave 10.14.6

ahy3nz commented 4 years ago

Yeah I got the same recursion error. If you drop the number of compounds in the water box, then you don't hit the recursion limit, but I don't think that's the solution you want.

This code snippet will work with your 50-ane and 20,000 waters. I think most parallel libraries end up having to pickle objects in order to distribute, so the workaround is to pickle something simpler than an mbuild compound, like the xyz numpy array. The extra work then goes to each worker where the waterbox compound has to get recreated prior to doing any kdtree searching. Again, not an elegant solution but it circumvents the recursion error for this system

import itertools as it

def get_overlaps(alk_particle, water_box_xyz):
    print(id(alk_particle)) # This is a debug line to track processes
    overlapping_atoms = []
    temp_waterbox = mb.Compound(subcompounds=(mb.Particle(pos=xyz) for xyz in water_box_xyz))
   # for xyz in water_box_xyz:
        #temp_particle = mb.Particle(pos=xyz)
        #temp_waterbox.add(temp_particle)

    for water_atom in temp_waterbox.particles_in_range(alk_particle, 0.25):
        overlapping_atoms.append(water_atom)

    return overlapping_atoms

args = zip([p for p in alkane.particles()], 
           it.repeat(water_box.xyz)
          )

with Pool(4) as p:
    output = p.starmap(get_overlaps, args)

Alternatively, you could use freud functionality to find overlaps/neighbors, and this works pretty well

import freud
box = freud.box.Box.from_box(water_box.periodicity)
aq = freud.locality.AABBQuery(box, water_box.xyz)
distances = []
for bond in aq.query(alkane.xyz, dict(r_max=0.25)):
    # The returned bonds are tuples of the form
    # (query_point_index, point_index, distance). For instance, a bond
    # (1, 3, 0.2) would indicate that points[3] was one of the 4 nearest
    # neighbors for query_points[1], and that they are separated by a
    # distance of 0.2
    # (i.e. np.linalg.norm(query_points[1] - points[3]) == 2).
    distances.append(bond[0])

from collections import Counter
how_many_overlaps_counter = Counter(distances)
vyasr commented 4 years ago

@ahy3nz has the right idea here. The issue is that objects must be serialized prior to distribution across processes. I wouldn't consider this solution particularly inelegant, tbh it's a pretty standard way to handle this. However, as @uppittu11 mentioned being able to use multiprocessing for analysis like this is in general pretty desirable, so you can automate this process somewhat by defining the __getstate__ and __setstate__ methods for mbuild compounds. Those methods control how an object is pickled and unpickled, so if you can define some canonical (simpler) representation of a compound (say a list of numpy arrays of positions, bonds, dihedrals, etc) then defining these methods will make pickling (and therefore multiprocessing) work transparently for users.

uppittu11 commented 4 years ago

Thanks for pointing me in the right direction, @ahy3nz and @vyasr.

Although it's probably not a short-term goal, It would be great to be able to use multiprocessing in the future when users would be setting up much larger systems. Also, mbuild could possibly replace the packmol functionality with our own parallel, python-based functions.

CalCraven commented 2 years ago

@uppittu11 Is this still a wishlist item or something you would be willing to implement. Lot's of these old issues are related to packmol issues, and something might have to be done all at once to address them. TBH, it might be worth an entire paper to come up with a new package that addresses some of the awkwardness that packmol brings to the table.