Closed chemicalfiend closed 1 year ago
I'll take a look at the speed of adding bonds, I personally haven't run into any speed issues at the moment, even for very large systems.
Recently I put a bit of effort into speeding up the adding of particles (and reworking some of the loader/conversion functions that add particles), see PR: https://github.com/mosdef-hub/mbuild/pull/1107 (basically, instead of adding compounds one at time, add them in as a list). I don't believe this revised functionality is in the current release on conda yet, but should be in the next.
It's possible issue you are encountering may actually be related to the addition of particles, not the bonds, as adding in the particles was definitely incredibly slow as system size increased (due to composing larger and small graphs together). Can you post an example of a system/code that is getting bogged down?
So, over the past few days I've been testing out some add particle and add_bond cases and timing it.
This notebook should serve to illustrate what I mean : https://github.com/chemicalfiend/lammps-carrier-mobility/blob/main/mobility/BBL-Film-Neutral-Dry/gsdwriter.ipynb . If you scroll to the bottom you can see that I've timed adding particles and adding bonds and it takes wayy longer to add bonds. Granted in this case there are about 10,000 more bonds than particles but the order of scaling for how long it took isn't linear and its taking hours longer, whie adding particles was hardly seconds.
I can try using the list thing too and compare it, I don't use the conda release of mbuild anyway.
Thank you for your response @chrisiacovella . I hope this issue helps.
Thanks for uploading the example of what you are doing specifically. That will be very helpful. I have an reasonable idea of what I think is causing the slow down (which should be pretty easy to fix). I'll try to get some examples/workarounds put together soon.
The add_bond function in the Compound basically just calls the add_edge function of networkx (as networkx handles the underlying bond_graph of the Compound data structure). So sadly, the speed seems to be a consequence of networkx and not really anything we can optimize in mbuild. We've talked a bit about potentially switching to graph-tool (which wraps to C++), but that is a significant overhaul, so not happening at this moment.
However, I think your code can be substantially sped up by a small amount of restructuring. Networkx seems to get bogged down as the size of the underlying graph grows (not just in terms of number of elements, but also memory usage associated with it, e.g., a graph of integers will be much faster than a graph that contains all the compound information, even if they contain the same number of elements).
Basically, whether it is adding particles or adding bonds, it is slow to add particles or bonds one at a time to the top level Compound (i.e., what you've called system in your code). It will be much faster to add particles/bonds to smaller groups, e.g., at the level of a molecule, then add the smaller group (i.e., molecule ) to the system. So in the code you posted, the time for adding compounds is pretty fast since you build up the system by adding particles to molecules, then molecules to the overall system, which is the efficient way to do it. i.e.,:
for i in range(num_molecules):
m = i + 1
mol = mb.Compound()
for atom in atoms:
if atom.mol == m:
a = mb.Particle(pos=[atom.x, atom.y, atom.z], element = atom.atomtype, name=atom.atomtype, charge= atom.charge)
mol.add(a)
system.add(mol)
The slow part is where you loop over the bonds and add them one at a time to the system. i.e.,:
for i in range(len(bondi)):
#print(f"Adding bond number {i}")
system.add_bond((system[bondi[i] - 1], system[bondj[i] - 1]))
If you were to combine this bond step into the first one (i.e., adding the bonds into the molecule), things should be much much fast.
To illustrate the time difference, I threw together a naive example of just adding in some random particles, where every 2 consecutive particles are bonded. This implementation should be slow.
for n in [100, 1000, 5000, 10000, 20000]:
particles = np.zeros(shape=(n,3))
for i in range(0,n):
particles[i] = np.random.rand(3)
bonds = []
for i in range(0,n):
if i%2 == 0:
bonds.append([i, i+1])
#init system
start = time.time()
system = mb.Compound()
compound_list= []
for i in range(0, particles.shape[0]):
compound_list.append(mb.Compound(name='C', pos=particles[i]))
system.add(compound_list)
for bond in bonds:
system.add_bond((system[bond[0]], system[bond[1]]))
end = time.time()
print("# particles ", system.n_particles, " n_bonds: ", system.n_bonds, " time: ", round(end-start, 3))
The time as number of particles/bonds for this unoptimized code.
# particles 100 n_bonds: 50 time: 0.013
# particles 1000 n_bonds: 500 time: 0.396
# particles 5000 n_bonds: 2500 time: 9.005
# particles 10000 n_bonds: 5000 time: 37.45
# particles 20000 n_bonds: 10000 time: 225.86
If I were to quickly change this such that I initialize molecules and the add particles and bonds to molecules, then add the molecules to the system, the time difference is pretty dramatic. In the code below I just create a temporary graph with networkx to get the connected components (i.e., those bonded in molecules). Since this graph only contains integers, this operation doesn't get bogged down like the graph underlying the Compound. So even throwing ~500K bonds at it, the performance is still good. There are of course other ways to do this that may or may not be faster than the graph (including using a faster graph library).
In the code below, the time for 500K particles is faster than 10k in the naive approach.
# particles 100 n_bonds: 45 time: 0.236
# particles 1000 n_bonds: 495 time: 0.038
# particles 5000 n_bonds: 2495 time: 0.174
# particles 10000 n_bonds: 4995 time: 0.35
# particles 20000 n_bonds: 9995 time: 15.497
# particles 50000 n_bonds: 24995 time: 2.46
# particles 100000 n_bonds: 49995 time: 5.384
# particles 500000 n_bonds: 249995 time: 25.847
# particles 1000000 n_bonds: 499995 time: 87.045
for n in [100, 1000, 5000, 10000, 20000, 50000, 100000]:
system = mb.Compound()
#initialize some random particle positions
particles = np.zeros(shape=(n,3))
for i in range(0,n):
particles[i] = np.random.rand(3)
# create a list of bonds
# we'll just bond 2 consecutive particles together
# leaving the last 10 out, just to demo how we
# might handle unbonded particles
bonds = []
for i in range(0,n-10):
if i%2 == 0:
bonds.append([i, i+1])
# start the timing!
start = time.time()
# going to create a bond graph that only contains
# integer edges so that we can identify which
# particles are connected
graph = nx.Graph()
for bond in bonds:
graph.add_edge(bond[0],bond[1])
# create a quick list that contains a status
# as to whether a particle has been added to the system
status = []
for i in range(0, n): status.append(False)
# It is much faster to pass a list of compounds to the add
# function, rather than adding one at a time.
# Adding one at a time gets very slow as the bond_graph of
# at compounds grows, related to the compose function in network x.
compound_list= []
# loop over the connected components, i.e., molecules
for c in nx.connected_components(graph):
# this will be the component to represent our molecule
mol = mb.Compound()
temp = list(c)
temp.sort()
# add compounds to the molecule
for item in temp:
mol.add(mb.Compound(name='C', pos=particles[item]))
status[item] = True
# add bonds to the molecule
for edge in graph.subgraph(c).edges:
mol.add_bond(edge)
# add the molecule to the compound list
compound_list.append(mol)
# take care of any unbonded particles
for i, item in enumerate(status):
if item == False:
compound_list.append(mb.Compound(name='N', pos=particles[item]))
status = True
# add the individual compounds to the system compound.
system.add(compound_list)
end = time.time()
print("# particles ", system.n_particles, " n_bonds: ", system.n_bonds, " time: ", round(end-start, 3))
Thank you so much this helps me a lot! I'll close the issue since it looks like going by what you said it's not an issue with mbuild but with dependencies. Appreciate it!
Great. Let us know if you have any other issues. I've been doing some tests and found an even easier way to speed this up.
Basically, the part that is slow is not really network x, but I think indexing into the system.children ordered set. That just seems much slower as the system size grows.
So you could keep your original code and basically just convert the children ordered set to a list, and things should go pretty fast which much less effort.
There are reasons the code uses an ordered set internally, but I think we've added a lot of additional checks that might negate the need for a set as compared to just a simple list. Something for us to discuss in a future dev meeting. But for now, the solution below should be easy and fast.
for n in [100, 1000, 5000, 10000, 20000, 50000, 100000]:
particles = np.zeros(shape=(n,3))
for i in range(0,n):
particles[i] = np.random.rand(3)
bonds = []
for i in range(0,n):
if i%2 == 0:
bonds.append([i, i+1])
#init system
start = time.time()
system = mb.Compound()
compound_list= []
for i in range(0, particles.shape[0]):
compound_list.append(mb.Compound(name='C', pos=particles[i]))
system.add(compound_list)
# convert the children orderedset into a list for more efficient indexing
particle_list = list(system.children)
for bond in bonds:
system.add_bond((particle_list[bond[0]], particle_list[bond[1]]))
end = time.time()
print("# particles ", system.n_particles, " n_bonds: ", system.n_bonds, " time: ", round(end-start, 3))
# particles 100 n_bonds: 50 time: 0.006
# particles 1000 n_bonds: 500 time: 0.075
# particles 5000 n_bonds: 2500 time: 0.102
# particles 10000 n_bonds: 5000 time: 0.173
# particles 20000 n_bonds: 10000 time: 0.365
# particles 50000 n_bonds: 25000 time: 1.04
# particles 100000 n_bonds: 50000 time: 2.582
I terms of modifying mbuild, I think I can remove the part where a compound is added to the bond graph as a node, and instead have it constructed solely based on when bonds are added (since that is the idea of it). Now this would mean the bond graph does not contain any particles that don't have bonds, but that is fine because we keep track of that already via compound.particles(). This would come up in the connected components part, and it should be easy to have that function append the list of particles without bonds to the list of connected components.
Wow, the speedups seem dramatic, I'll try it for sure. Thank you again. I'll keep you posted if I have any other concerns.
FYI, a PR was just merged that swaps Compound.children from an OrderedSet to a list, which speeds up the entire workflow (both adding particles and adding bonds). This is not part of the release yet (although merged into the master branch on GitHub), so if you are install from conda, you should probably just still use the approach above (i.e., just converting children to a list and referencing that list for adding bonds).
Thanks for raising this issue though. I personally did not have any workflows where I added bonds after initializing all particle coordinates, so I never ran into this. This definitely helped to find this serious inefficiency.
https://github.com/mosdef-hub/mbuild/pull/1121
A few technical notes to document here, so the decision making in the PR is not lost:
mBuild was previously using its own code to define an OrderedSet. This seems to be based Hettiger's algorithm, which unfortunately has O(n) indexing, hence the performance issues as system size grows. Swapping this out for ordered-set/OrderedSet vastly improved performance of indexing into the children data structure; specifically O(1) since the code uses a standard list to provide efficient indexing, although that has drawbacks for other features (see the GitHub repo for the project). Ultimately though, we do not necessarily need to use an OrderedSet to store the children. I'm sure initially children was defined as a set to avoid having the same compound added twice, but this is largely unnecessary because the code itself checks to see if a compound has a parent before being added (and will throw and error if a compound already has a parent, rather than adding it to the children, hence it will never add the same compound/particle twice). As such, @daico007 and I agreed that a simple list would be the most efficient as we really weren't using any of the additional functionality provided by having an OrderedSet, so we didn't need the additional overhead. If we find we do need some functionality that comes from having an OrderedSet in some future code, it would be trivial to copy the list of children into an orderedset, at which point we can decide between using the implementation in MoSDeF or the implementation in the ordered-set project, depending on which operations we need to be most efficient (i.e. indexing vs. deletion).
All the tested speedups worked great! Can anything be done to also speed up this part :
system.save("system.gsd")
for large systems, after adding all the bonds and particles, this seems to take up a lot of time too. Is there a way to save individual compounds one at a time or something? Should I start a new issue for this? I think this may be an issue with the gsd package rather than mbuild itself though. @chrisiacovella
Hi @chemicalfiend, this probably warrant an issue by itself, seems like this is the issue with the gsd writer and not the compound structure itself. I can do some profiling and let you know what i find.
Alright I'll start a new issue
add_bond is realy slow and I think it really should be sped up if mbuild has to support larger systems.
Adding tens of thousands of bonds to a system at a time (which isn't an uncommon process) takes hours on a CPU.