graeter-group / kimmdy

Reactive MD pipeline for GROMACS using Kinetic Monte Carlo / Molecular Dynamics (KIMMDY)
https://graeter-group.github.io/kimmdy/
GNU General Public License v3.0
8 stars 1 forks source link

topology keys datatype #455

Open jmbuhr opened 3 months ago

jmbuhr commented 3 months ago

So.... I have the urge for a "small" refactor that nonetheless touches most parts of kimmdys code and need your (@KRiedmiller @ehhartmann ) input if this is a good idea.

Our RecipeSteps use atom ixs (=0-based, integers) internally (but accept ids (1-based, str) or ixs on creation) and also print out using ixs. This make it consistent with e.g. MDAnalysis. But our topology keys for atoms, bonds etc. are 1-based strings, which aligns with gromacs atom ids and made sense at the time. But as we found out later, those atom ids from the file can lie (fixed-width format...) and the actual atom id just depends on the line number. Using integers as keys will reduce the memory footprint of the topology considerably, making it a lot easier to work with a) very large systems like our big collagen systems or b) systems where the solvent needs to be reactive (i.e. also many atoms in the topology). To keep it consistent these keys should also be 0-based. And when you see a recipe like kimmdy.3_apply_recipe INFO: Recipe: 283⚡285 22574⚡22575 22574⚡22576 22574➡283 22575➡285 22576➡285 in your log you can then use those numbers directly as ixs on the topology.

This is of course a bigger change than simply making the RecipeSteps print out ids instead of ixs, but it makes it overall more consistent.

KRiedmiller commented 3 months ago

I'm in favor of making it consistent everywhere, using ints, and zero basen indexing. :+1: Is it strictly necessary atm? Did you come across a case where this became an issue?

jmbuhr commented 3 months ago

Not strictly, but if we do I would like to address it sooner rather than later, because it significantly reduces mental overhead and misunderstandings when writing and testing reaction plugins as well.

It's an issue with the very large collagen systems.

ehhartmann commented 3 months ago

With 0 based ints, one would have to convert every time something is written to the topology, right? Could you explain a bit more how exactly you would refactor the atom identifiers?

jmbuhr commented 3 months ago

no, the ints would be in the keys, so the content like e.g. atom.nr is still the gromacs atom number and thus a string. Refactoring will be fairly straightforward, just change the dictionary types and follow the screaming typechecker :)

jmbuhr commented 3 months ago

The tradeoff will be at points where the location of an item is inferred from the content of another item.

e.g. Atom(nr='1', type='HC', resnr='828', residue='ACE', atom='HH31', cgnr='1', charge='0.11 23', mass='1.008', typeB=None, chargeB=None, massB=None, bound_to_nrs=['2'], is_radical=False) will be at ix 0 in top.atoms so in order to make it easier to find e.g. atoms of a bond it would need the ix versions of certain properties, that are used internally but not written to topology files. Like how a bond has the strings (for gromacs) ai and aj, which refer to one-based atom ids, would be at key (int(ai) - 1, int(aj) - 1) in top.bonds. So the potential for off-by-one errors is not gone after this change, rather moved to different spots. They should be easier to spot by type checking though.

I'll have to just try this to see what parts are actually affected.

jmbuhr commented 3 months ago

On nth thought, there is not solution to this that is entirely consistent (which is deeply unsatisfying) with all libraries involved and internally. Take the following example of iterating over bonds and e.g. checking properties of atoms involved in the bond:

        for b in top.bonds.values():
            ai = top.atoms[b.ai]
            aj = top.atoms[b.aj]

In a bond we have ai and aj as the atom number (=groamcs id, 1-based), because that is also used for writing the line out to a top file (like @ehhartmann said). This means they have to either

jmbuhr commented 3 months ago

So maybe we just stick with 1-based-str for now and change the way RecipeSteps print out to also use those for consistency.

KRiedmiller commented 2 months ago

I like the second variant, setting the boarder between uncertainty and known consistent types as file vs in-memory. The impact on read/write speed should be minimal I guess.

This way numbers behave like numbers should, and there is no ambiguity within our code.

However, this is probably not needed for release 2.0, as it should not change how the user interfaces with KIMMDY.

jmbuhr commented 2 months ago

Agreed, I like the clear interface.