Element handling - Githubissues

mattwthompson commented 2 years ago

The current refactor switches out element handling from OpenMM to mendeleev, checks off a lot of our boxes - it has all of the features we need (and more), it's actively developed and well-documented, and it's available on conda-forge. But it's also a little heavy at import time and with respect to simple conversions, i.e. atomic number -> element symbol. See https://github.com/openforcefield/openff-toolkit/pull/1177 for some timings after optimization. AFAICT

The toolkit only needs to know the atomic number of an atomic, sometimes also needs the element symbol and/or mass, but nothing more
The toolkit does not currently support isotopes and it's unclear if it will (see a discussion https://github.com/openforcefield/openff-toolkit/issues/974)
Anything that a downstream user/library (would want to look up about an element can be done using the atomic number as a starting point.

There's an easy tendency to want to pull in big libraries to do relatively simple things, which can be a poor match between maintainability/packaging/import times and features added. My counter-proposals, either of which I'd be happy to contribute

Use a simpler, smaller element package, like ele. The author has recently left for industry, but the MoSDeF ecosystem uses it and I expect we'd be able to maintain it via contributions - or simply fork it and go from there.
Don't use an element package at all - just have a couple of dicts hanging around in some file, probably {atomic_number: symbol} and {atomic_number: mass} mappings. These would map between simple types (int, str, and int/quantity) and the toolkit would not explicitly have any objects representing elements. It would also allow us to keep the same values reported by OpenMM without importing it, whereas another unit package likely introduces small changes.

My only strong preference is to get away from using openmm.app.element - it's a module that has worked great for our purposes but I would love our typical conda-based users to not need to install the CUDA toolkit to parse a SMILES string. Switching to mendeleev satisfies this request, as does the other ideas above.

Keep in mind we're already breaking the API of the element and mass properties, so if we wanted to make a big change, now is the time to consider it.

mattwthompson commented 2 years ago

A subtle detail to consider is that switching element packages would have a minor impact on atomic masses. Mendeleev cites (click through to ref 58) a 2017 report from IUPAC, but seems to truncate a few significant digits. OpenMM doesn't cite a specific source but seems to be using data from around the early 2000s. Not that it changes much or is so important, and I'm not interested in digging more into the revisions to find out.

This is certainly be a behavior change(DeepDiff accurately catches this as a change before the sixth significant digit). Whether this qualifies as a regression or matters at all is maybe a matter of opinion. I don't care what we go with, but documenting the data source of the masses that end up in a users' objects might be worthwhile. If we went with our own light implementation we'd be able to control whether we want to match OpenMM's values or some year of IUPAC data.

import openmm.app
import mendeleev

print("Atomic number   mendleev         OpenMM")
for atomic_number in range(1, 20):
    print(
        atomic_number,
        "\t",
        f"{mendeleev.element(atomic_number).mass:15g}",
        "\t",
        openmm.app.element.Element.getByAtomicNumber(atomic_number).mass,
    )

Atomic number   mendleev         OpenMM
1              1.008     1.007947 Da
2             4.0026     4.003 Da
3               6.94     6.9412 Da
4            9.01218     9.0121823 Da
5              10.81     10.8117 Da
6             12.011     12.01078 Da
7             14.007     14.00672 Da
8             15.999     15.99943 Da
9            18.9984     18.99840325 Da
10           20.1797     20.17976 Da
11           22.9898     22.989769282 Da
12            24.305     24.30506 Da
13           26.9815     26.98153868 Da
14            28.085     28.08553 Da
15           30.9738     30.9737622 Da
16             32.06     32.0655 Da
17             35.45     35.4532 Da
18            39.948     39.9481 Da
19           39.0983     39.09831 Da

j-wags commented 2 years ago

I'm in support of having our own dicts of atomic_number to mass and symbol. This will keep everything in our regression testing suite from blowing up if an external package changes some masses (even if the change is correct!). We could put it in openff-units to keep our constants together.

For the values, it would be great if we could find an authoritative source that "coincidentally" agrees with OpenMM's current masses. This may seem silly, but the entire concept of doing atomistic simulations with "average by earth crust composition" masses is of questionable justification too, so I don't think there's a strong authority on this point anyway.

I think specific-isotope masses would be even better, but that would be a big deviation from the rest of the field, so I don't think it's a strictly infrastructure question, and we can discuss that in #974 separately.

mattwthompson commented 2 years ago

Closed #1182, will land in v0.11.0 and the associated API breaks will be noted in the release notes and/or migration guide.

openforcefield / openff-toolkit

Element handling #1178