Parallelize summary methods

mattwthompson commented 1 month ago

@ntBre reports that .get_tfd, .get_icrmsd, and possibly methods can take ~20 minutes in some runs. These operations are almost surely serial and I haven't ever thought about performance here. This makes a difference if paying for compute, especially on runners with many cores.

ntBre commented 1 month ago

I ran a test script on my desktop to check the timings today. Here's the script:

import time

from yammbs import MoleculeStore

# full file from hpc3 after running an industry benchmark
store = MoleculeStore("/tmp/test.sqlite")

# doesn't have to exist, just has to match the force_field column in
# mm_conformers table
forcefield = "forcefields/new-tm-2.2.offxml"

for label, fn in [
    ("dde", MoleculeStore.get_dde),
    ("rmsd", MoleculeStore.get_rmsd),
    ("tfd", MoleculeStore.get_tfd),
    ("icrmsd", MoleculeStore.get_internal_coordinate_rmsd),
]:
    start = time.time()
    fn(store, forcefield, skip_check=True)
    print(f"{label}: {time.time() - start:.2f} s")

And the output:

dde: 11.27 s
rmsd: 444.41 s
tfd: 680.38 s
icrmsd: 5244.86 s

Clearly the internal-coordinate RMSDs are the slowest by a big margin.

I'd be interested in working on speeding this up, but I'm also happy for you to handle it if you've already started/are interested yourself! The sqlite file I used for this was 318 MB, but I'm sure I can find a way to get it to you if you want to test locally too.

ntBre commented 3 weeks ago

I repeated the benchmark above on a much smaller dataset (60 molecules vs 9873 molecules) to get a more useful reference for repeated benchmarks and got the following results:

dde: 0.07 s
rmsd: 1.33 s
tfd: 1.77 s
icrmsd: 12.74 s

Then I used this profiling script to look for hot spots in the internal coordinate RMSD code:

import cProfile
from pstats import SortKey

from yammbs import MoleculeStore

store = MoleculeStore("openff-2.2.0.sqlite")
forcefield = "openff-2.2.0.offxml"

with cProfile.Profile() as pr:
    store.get_internal_coordinate_rmsd(forcefield, skip_check=True)
    pr.print_stats(SortKey.CUMULATIVE)

Most of the time is obviously spent in the get_internal_coordinate_rmsd function in _store.py and then in the function of the same name in analysis.py. Then, it looks like all of the time is spent in geometric functions, which I don't think we can/want to optimize, so I think the best choice is just to parallelize with multiprocessing. I'll give that a try!

Here's the top of the profile output too. I can upload the whole thing if it's of interest.

         38643115 function calls (38021807 primitive calls) in 19.809 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010   19.843   19.843 _store.py:701(get_internal_coordinate_rmsd)
      367    0.018    0.000   19.403    0.053 analysis.py:121(get_internal_coordinate_rmsds)
      367    0.012    0.000   15.020    0.041 internal.py:2006(__init__)
      367    1.059    0.003   14.777    0.040 internal.py:2025(makePrimitives)
   178176    1.805    0.000    4.979    0.000 numeric.py:1468(cross)
      367    0.001    0.000    4.169    0.011 analysis.py:159(<dictcomp>)
     1468    0.056    0.000    4.168    0.003 analysis.py:160(<listcomp>)
    77392    0.529    0.000    3.557    0.000 internal.py:870(normal_vector)
    45498    0.373    0.000    3.288    0.000 internal.py:1296(value)
   534528    0.929    0.000    2.790    0.000 numeric.py:1393(moveaxis)
   204264    1.304    0.000    2.637    0.000 internal.py:846(value)
     1101    0.100    0.000    2.237    0.002 molecule.py:2177(build_topology)
   737883    0.474    0.000    2.217    0.000 fromnumeric.py:2177(sum)
    54654    0.429    0.000    1.942    0.000 internal.py:2565(add)
   741246    0.572    0.000    1.695    0.000 fromnumeric.py:71(_wrapreduction)
  1069056    0.951    0.000    1.457    0.000 numeric.py:1330(normalize_axis_tuple)
      367    0.013    0.000    1.325    0.004 molecule.py:1860(atom_select)
     9958    0.199    0.000    1.154    0.000 mst.py:141(kruskal_mst_edges)
   191332    0.935    0.000    0.973    0.000 graph.py:893(add_edge)
   924335    0.949    0.000    0.949    0.000 internal.py:1279(__eq__)

mattwthompson commented 3 weeks ago

Do we know if geomeTRIC itself is optimized for this sort of behavior? I don't know where it's been used at scale, so I don't want to run past "can this be optimized for performance?" if people haven't used it before in the way we are

ntBre commented 3 weeks ago

That's a good point, I'm not sure. I saw how much time it was spending doing numpy.cross products and assumed it was mostly doing necessary math, but from a quick peek at the code it could certainly be doing stuff that we don't need. I'll take a closer look at this before slapping multiprocessing on it.

ntBre commented 3 weeks ago

I looked a bit closer, and I don't really think this is the intended usage. geometric itself appears to keep around the PrimitiveInternalCoordinates objects for longer rather than creating a bunch of them on the fly, and there's also a lot of possibly duplicated work with the molecule's topology since we could otherwise use the OpenFF Molecule.to_topology or to_networkx methods. (Instead we're currently converting OpenFF Mol -> geometric Mol -> geometric topology -> networkx Graph).

I copied over the parts of the PrimitiveInternalCoordinates class that we need and tried deleting some of the parts we don't need, but it didn't lead to much, if any, measurable speedup. I'm sure there was more I could do here, but I wasn't quite sure how well-covered some of my changes were by the existing yammbs tests.

The most interesting part of geometric for our use case is the value method on the Distance, Angle, Dihedral, and OutOfPlane classes. Constructing instances of these classes is done by PrimitiveInternalCoordinates.makePrimitives, but the constructors for these classes just take atom indices that correspond to the geometry later passed to the value method. If we could turn our own topologies into sequences of these classes, we could skip the initialization of a full PrimitiveInternalCoordinates. We could also possibly cache some of this work since different conformers of the same molecule should have the same connectivity and same internal coordinates, I think.

Otherwise, I have a rough draft with multiprocessing that takes the profiling run above from 19.8 seconds down to 4.5 seconds with 8 CPUs. It possibly needs one more modification to use a generator instead of a list to avoid allocating too much memory, but it's working and passing the current yammbs analysis tests, at least.

openforcefield / yammbs

Parallelize summary methods #56