speeding up add function for large compounds

chrisiacovella commented 1 year ago

PR Summary:

This refers to issue #1104 . This PR aims to improve the performance of the Compound.add function and loading routines that rely upon it.

The basic gist, as outlined in the issue above, is that when constructing a compound using the add function, the performance can degrade as the compound grows in size, due to the repeated merging (i.e., composing) of bond_graphs, specifically, merging a small with a large bond graph over and over again. This PR changes the underlying logic such that if a list of Compounds is passed to the add function, it will use the compose_all function to merge these bond_graphs together, before adding to the root Compound (and merging bond_graphs with the root compound). The compose_all function effectively scales with the number of compounds being merged.

This provides substantial speed improvements, as outlined in the issue.

Other additions, Compound.add now accepts a list for the label argument if compounds are provided in a list.

This is still a WIP, as tests need to be added for adding labels via a list, as well as adding in the updated load functions to stash compounds into lists (mdtraj conversion is basically complete and provides substantial speed up).

PR Checklist

[x] Includes appropriate unit test(s)
[x] Appropriate docstring(s) are added/updated
[x] Code is (approximately) PEP8 compliant
[x] Issue(s) raised/addressed?

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 97.56% and project coverage change: +0.10 :tada:

Comparison is base (1211260) 89.39% compared to head (76bcdc8) 89.50%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1107 +/- ## ========================================== + Coverage 89.39% 89.50% +0.10% ========================================== Files 61 61 Lines 6235 6305 +70 ========================================== + Hits 5574 5643 +69 - Misses 661 662 +1 ``` | [Impacted Files](https://codecov.io/gh/mosdef-hub/mbuild/pull/1107?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mosdef-hub) | Coverage Δ | | |---|---|---| | [mbuild/conversion.py](https://codecov.io/gh/mosdef-hub/mbuild/pull/1107?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mosdef-hub#diff-bWJ1aWxkL2NvbnZlcnNpb24ucHk=) | `89.55% <93.54%> (+0.22%)` | :arrow_up: | | [mbuild/compound.py](https://codecov.io/gh/mosdef-hub/mbuild/pull/1107?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mosdef-hub#diff-bWJ1aWxkL2NvbXBvdW5kLnB5) | `97.07% <100.00%> (+0.12%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

chrisiacovella commented 1 year ago

timings for loading mol2 files of different size:	N waters	new time (s)
1000	0.347	4.92
5000	1.46	266.1
10000	3.02	too slow

chrisiacovella commented 1 year ago

Speed improvements for converting parmed are effectively the same as for mdtraj. Note, converting from gmso will need modification still, but that calls a function in gmso, so it will need to be a separate gmso PR.

chrisiacovella commented 1 year ago

I added in a condense function to Compound. This is similar to flatten, but it adds an intermediate level in the hierarchy based on connectivity. This refers to issue #1108 .

To reiterate the issue, take a compound that is like this:

Compound, 28 particles, 18 bonds, 4 children
├── [C x 1], 1 particles, 0 bonds, 0 children
├── [Compound x 1], 3 particles, 2 bonds, 1 children
│   └── [tip3p x 1], 3 particles, 2 bonds, 3 children
│       ├── [H2 x 1], 1 particles, 1 bonds, 0 children
│       ├── [H3 x 1], 1 particles, 1 bonds, 0 children
│       └── [O1 x 1], 1 particles, 2 bonds, 0 children
└── [Compound x 2], 12 particles, 8 bonds, 4 children
    └── [Compound x 4], 3 particles, 2 bonds, 1 children
        └── [tip3p x 1], 3 particles, 2 bonds, 3 children
            ├── [H2 x 1], 1 particles, 1 bonds, 0 children
            ├── [H3 x 1], 1 particles, 1 bonds, 0 children
            └── [O1 x 1], 1 particles, 2 bonds, 0 children

And make it this:

Compound, 28 particles, 18 bonds, 10 children
├── [C x 1], 1 particles, 0 bonds, 0 children
└── [tip3p x 9], 3 particles, 2 bonds, 3 children
    ├── [H2 x 1], 1 particles, 1 bonds, 0 children
    ├── [H3 x 1], 1 particles, 1 bonds, 0 children
    └── [O1 x 1], 1 particles, 2 bonds, 0 children

I still need to finish adding in tests for this; that will come in the next push.

chrisiacovella commented 1 year ago

I addressed all the comments, including making one list_flatten helper function (doesn't add any real overhead).

mosdef-hub / mbuild