openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Replace NetworkX calls in `Molecule.from_pdb` with graph-tool #1143

Open mattwthompson opened 2 years ago

mattwthompson commented 2 years ago

graph-tool seems to be a more performant replacement for networkx. It is available on conda-forge, although it may or may not make our dependency footprint larger, i.e.

Depends: cairomm, cairomm-1.0 >=1.12.2, gdk-pixbuf, graph-tool-base >=2.43,<2.44.0a0, gtk3, libglib >=2.68.3,<3.0a0, librsvg, matplotlib-base, pycairo, pygobject, python >=3.8,<3.9.0a0, python_abi 3.8. _cp38

mattwthompson commented 2 years ago

There is another project retworkx that has some good benchmarks on distance/pair calculations and subgraph isomorphisms. It looks less mature that the graph-tool but they have a JOSS paper in review.

igraph is also mentioned and appears to perform well.

jchodera commented 2 years ago

@mattwthompson : It would be helpful to provide some timing and profiling data justifying the importance of this change. Do you have (1) a benchmark case, (2) timing data, and (3) profiling data clearly demonstrating the need and pinning blame on a specific NetworkX operation that would clearly be improved by an alternative implementation?

It's not clear to me, for example, that such a need actually exists, what its priority is, or whether it is an algorithmic (rather than a performance issue, where gains due to optimizations are usually much smaller).

mattwthompson commented 2 years ago

There's no compelling need in the existing toolkit functionality, but some part of the topology refactor would benefit from this. There's some amount of NetworkX graph isomorphism used in Molecule.from_pdb. Unfortunately that PR crashes my browser and even git diff chokes locally for me, so it's hard for me to dig into the source code now, but I ran a quick snakeviz run on this file (with crystal waters trimmed out), which has only 2423 atoms. The overall runtime is slow (150s) because caching is not fully implemented yet, so the direct impact of NetworkX's slow algorithms is blurred. This can be revisited once cached and other speedups are implemented.

test.prof.txt

mattwthompson commented 1 month ago

For some medium-sized systems, ~70% of Interchange.to_gromacs is spent creating NetworkX graphs. This risks being an upper bound on how fast writing out GROMACS files can be for systems of more than a few thousand atoms. The toolkit already tries to do things like short-circuit before graph-matching when the molecules very obviously differ. Though perhaps there are other ideas to explore (more efficient way to create these graphs, other ways graph-matching could altogether be bypassed, etc.).

https://github.com/openforcefield/openff-interchange/issues/1014#issuecomment-2263459426