Open mattwthompson opened 1 month ago
I ran a test script on my desktop to check the timings today. Here's the script:
import time
from yammbs import MoleculeStore
# full file from hpc3 after running an industry benchmark
store = MoleculeStore("/tmp/test.sqlite")
# doesn't have to exist, just has to match the force_field column in
# mm_conformers table
forcefield = "forcefields/new-tm-2.2.offxml"
for label, fn in [
("dde", MoleculeStore.get_dde),
("rmsd", MoleculeStore.get_rmsd),
("tfd", MoleculeStore.get_tfd),
("icrmsd", MoleculeStore.get_internal_coordinate_rmsd),
]:
start = time.time()
fn(store, forcefield, skip_check=True)
print(f"{label}: {time.time() - start:.2f} s")
And the output:
dde: 11.27 s
rmsd: 444.41 s
tfd: 680.38 s
icrmsd: 5244.86 s
Clearly the internal-coordinate RMSDs are the slowest by a big margin.
I'd be interested in working on speeding this up, but I'm also happy for you to handle it if you've already started/are interested yourself! The sqlite file I used for this was 318 MB, but I'm sure I can find a way to get it to you if you want to test locally too.
I repeated the benchmark above on a much smaller dataset (60 molecules vs 9873 molecules) to get a more useful reference for repeated benchmarks and got the following results:
dde: 0.07 s
rmsd: 1.33 s
tfd: 1.77 s
icrmsd: 12.74 s
Then I used this profiling script to look for hot spots in the internal coordinate RMSD code:
import cProfile
from pstats import SortKey
from yammbs import MoleculeStore
store = MoleculeStore("openff-2.2.0.sqlite")
forcefield = "openff-2.2.0.offxml"
with cProfile.Profile() as pr:
store.get_internal_coordinate_rmsd(forcefield, skip_check=True)
pr.print_stats(SortKey.CUMULATIVE)
Most of the time is obviously spent in the get_internal_coordinate_rmsd
function in _store.py
and then in the function of the same name in analysis.py
. Then, it looks like all of the time is spent in geometric functions, which I don't think we can/want to optimize, so I think the best choice is just to parallelize with multiprocessing. I'll give that a try!
Here's the top of the profile output too. I can upload the whole thing if it's of interest.
38643115 function calls (38021807 primitive calls) in 19.809 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.010 0.010 19.843 19.843 _store.py:701(get_internal_coordinate_rmsd)
367 0.018 0.000 19.403 0.053 analysis.py:121(get_internal_coordinate_rmsds)
367 0.012 0.000 15.020 0.041 internal.py:2006(__init__)
367 1.059 0.003 14.777 0.040 internal.py:2025(makePrimitives)
178176 1.805 0.000 4.979 0.000 numeric.py:1468(cross)
367 0.001 0.000 4.169 0.011 analysis.py:159(<dictcomp>)
1468 0.056 0.000 4.168 0.003 analysis.py:160(<listcomp>)
77392 0.529 0.000 3.557 0.000 internal.py:870(normal_vector)
45498 0.373 0.000 3.288 0.000 internal.py:1296(value)
534528 0.929 0.000 2.790 0.000 numeric.py:1393(moveaxis)
204264 1.304 0.000 2.637 0.000 internal.py:846(value)
1101 0.100 0.000 2.237 0.002 molecule.py:2177(build_topology)
737883 0.474 0.000 2.217 0.000 fromnumeric.py:2177(sum)
54654 0.429 0.000 1.942 0.000 internal.py:2565(add)
741246 0.572 0.000 1.695 0.000 fromnumeric.py:71(_wrapreduction)
1069056 0.951 0.000 1.457 0.000 numeric.py:1330(normalize_axis_tuple)
367 0.013 0.000 1.325 0.004 molecule.py:1860(atom_select)
9958 0.199 0.000 1.154 0.000 mst.py:141(kruskal_mst_edges)
191332 0.935 0.000 0.973 0.000 graph.py:893(add_edge)
924335 0.949 0.000 0.949 0.000 internal.py:1279(__eq__)
Do we know if geomeTRIC itself is optimized for this sort of behavior? I don't know where it's been used at scale, so I don't want to run past "can this be optimized for performance?" if people haven't used it before in the way we are
That's a good point, I'm not sure. I saw how much time it was spending doing numpy.cross products and assumed it was mostly doing necessary math, but from a quick peek at the code it could certainly be doing stuff that we don't need. I'll take a closer look at this before slapping multiprocessing on it.
I looked a bit closer, and I don't really think this is the intended usage. geometric itself appears to keep around the PrimitiveInternalCoordinates
objects for longer rather than creating a bunch of them on the fly, and there's also a lot of possibly duplicated work with the molecule's topology since we could otherwise use the OpenFF Molecule.to_topology
or to_networkx
methods. (Instead we're currently converting OpenFF Mol -> geometric Mol -> geometric topology -> networkx Graph
).
I copied over the parts of the PrimitiveInternalCoordinates
class that we need and tried deleting some of the parts we don't need, but it didn't lead to much, if any, measurable speedup. I'm sure there was more I could do here, but I wasn't quite sure how well-covered some of my changes were by the existing yammbs tests.
The most interesting part of geometric for our use case is the value
method on the Distance
, Angle
, Dihedral
, and OutOfPlane
classes. Constructing instances of these classes is done by PrimitiveInternalCoordinates.makePrimitives
, but the constructors for these classes just take atom indices that correspond to the geometry later passed to the value
method. If we could turn our own topologies into sequences of these classes, we could skip the initialization of a full PrimitiveInternalCoordinates
. We could also possibly cache some of this work since different conformers of the same molecule should have the same connectivity and same internal coordinates, I think.
Otherwise, I have a rough draft with multiprocessing that takes the profiling run above from 19.8 seconds down to 4.5 seconds with 8 CPUs. It possibly needs one more modification to use a generator instead of a list to avoid allocating too much memory, but it's working and passing the current yammbs analysis tests, at least.
@ntBre reports that
.get_tfd
,.get_icrmsd
, and possibly methods can take ~20 minutes in some runs. These operations are almost surely serial and I haven't ever thought about performance here. This makes a difference if paying for compute, especially on runners with many cores.