Open jmbuhr opened 3 months ago
I'm in favor of making it consistent everywhere, using ints, and zero basen indexing. :+1: Is it strictly necessary atm? Did you come across a case where this became an issue?
Not strictly, but if we do I would like to address it sooner rather than later, because it significantly reduces mental overhead and misunderstandings when writing and testing reaction plugins as well.
It's an issue with the very large collagen systems.
With 0 based ints, one would have to convert every time something is written to the topology, right? Could you explain a bit more how exactly you would refactor the atom identifiers?
no, the ints would be in the keys, so the content like e.g. atom.nr
is still the gromacs atom number and thus a string. Refactoring will be fairly straightforward, just change the dictionary types and follow the screaming typechecker :)
The tradeoff will be at points where the location of an item is inferred from the content of another item.
e.g. Atom(nr='1', type='HC', resnr='828', residue='ACE', atom='HH31', cgnr='1', charge='0.11 23', mass='1.008', typeB=None, chargeB=None, massB=None, bound_to_nrs=['2'], is_radical=False)
will be at ix 0
in top.atoms
so in order to make it easier to find e.g. atoms of a bond it would need the ix
versions of certain properties, that are used internally but not written to topology files. Like how a bond has the strings (for gromacs) ai
and aj
, which refer to one-based atom ids, would be at key (int(ai) - 1, int(aj) - 1)
in top.bonds
. So the potential for off-by-one errors is not gone after this change, rather moved to different spots. They should be easier to spot by type checking though.
I'll have to just try this to see what parts are actually affected.
On nth thought, there is not solution to this that is entirely consistent (which is deeply unsatisfying) with all libraries involved and internally. Take the following example of iterating over bonds and e.g. checking properties of atoms involved in the bond:
for b in top.bonds.values():
ai = top.atoms[b.ai]
aj = top.atoms[b.aj]
In a bond we have ai
and aj
as the atom number (=groamcs id, 1-based), because that is also used for writing the line out to a top file (like @ehhartmann said). This means they have to either
So maybe we just stick with 1-based-str for now and change the way RecipeStep
s print out to also use those for consistency.
I like the second variant, setting the boarder between uncertainty and known consistent types as file vs in-memory. The impact on read/write speed should be minimal I guess.
This way numbers behave like numbers should, and there is no ambiguity within our code.
However, this is probably not needed for release 2.0, as it should not change how the user interfaces with KIMMDY.
Agreed, I like the clear interface.
So.... I have the urge for a "small" refactor that nonetheless touches most parts of kimmdys code and need your (@KRiedmiller @ehhartmann ) input if this is a good idea.
Our RecipeSteps use atom
ixs
(=0-based, integers) internally (but acceptids
(1-based, str) or ixs on creation) and also print out using ixs. This make it consistent with e.g. MDAnalysis. But our topology keys for atoms, bonds etc. are 1-based strings, which aligns with gromacs atom ids and made sense at the time. But as we found out later, those atom ids from the file can lie (fixed-width format...) and the actual atom id just depends on the line number. Using integers as keys will reduce the memory footprint of the topology considerably, making it a lot easier to work with a) very large systems like our big collagen systems or b) systems where the solvent needs to be reactive (i.e. also many atoms in the topology). To keep it consistent these keys should also be 0-based. And when you see a recipe likekimmdy.3_apply_recipe INFO: Recipe: 283⚡285 22574⚡22575 22574⚡22576 22574➡283 22575➡285 22576➡285
in your log you can then use those numbers directly as ixs on the topology.This is of course a bigger change than simply making the RecipeSteps print out ids instead of ixs, but it makes it overall more consistent.