Closed vyasr closed 3 years ago
@vyasr Thank you for review of GMSO
repository. I will be able to answer the discussion around performance and hashing:
Sites
and Connections
not being hashable, is because they are mutable and hence this will cause a problem of ghost elements, unintended index changes etc. We deal with with hashable
Potential
and AtomType
(which are also mubtable) objects by tracking changes on them once they within the realms of a Topology. The Connection
and Site
object represent a physical entity and hence two sites or two connections will never be the same. I agree that the properties
could use some form of optimization, because forming objects on the fly is not good.If you're unfamiliar with the concept, the idea comes from relational databases like SQL where you reduce redundancy by storing unique entities in separate tables and have everything else point to that (I'm being very rough and skipping a lot of details, but this is the core idea that would be relevant here.
This is done for AtomTypes
, BondTypes
, AngleTypes
, DihedralTypes
, ImproperTypes
. And Sites
and Connections
were meant to be redundant.
One thing that could improve performance is to define a proper hierarchy such that updates to every lower member in the hierarchy wouldn't need an orchestration of updates from Topology
.
@umesh-timalsina thanks for the response! Here are my thoughts.
Sites
and Connections
to be mutable makes sense for some use cases, but in that case I'm not sure what the point of using IndexedSet
s to store them in the Topology
is. If the goal is just to avoid users from putting the exact same Site
object into the Topology
but the user could create two identical but distinct Site
objects (i.e. same name, position, etc), then I would argue that this is a very weak protection and not worth using such a data type. In that case, I would just use a simple list to store these instead. Is that unviable for a different reason?Sites
and Connections
being redundant in the way that you describe? This is an area where my lack of experience makes it hard to know what the right suggestion is, but ultimately I'm thinking that (assuming the user needs an API to access a Topology
's sites through an API of Site
instances) there's a tradeoff here. One option is to do what is currently being done, which is literally storing a set of Site
objects in an internal collection and present those to the user whenever they query the relevant property. This is the easiest method to implement. The alternative is to parse provided sites into some internal data structures (perhaps what I proposed, perhaps something else), but provide convenience APIs for the user to be able to interact with that data structure as though it was a list of Site
s. This method is harder, but would allow you to use internal data structures that are much more amenable to fast batch processing when it's needed. This method would also be facilitated by ducktyping inputs to add_site
to allow either gmso.Site
objects or other inputs, for instance tuples like (position, element)
or so, but that could be a nice feature in any case. There are details to work out such as how you would create the underlying gmso.Site
object, but I don't think that kind of thing is an insurmountable barrier.@vyasr
The use of IndexedSet
is primarily done for O(1) access for __contains__
, which I believe is possible with a collection like set/dictionary and I agree that it will not solve the redundancy problem. But, in terms of redundancy, there is no meaningful insight I could provide because I am in the same boat as you when it comes to experience in this domain.
Once, we decide the hierarchy of objects(#385) or a free-form implementation (#381), I believe performance hits we get can be resolved in a meaningful way by the use of proper data-structures/ and or design patterns.
I see, if that's the primary reason then that makes sense. I don't know enough to know whether x in l
is a common enough need to justify that, but it seems reasonable.
I agree that those decisions need to be made before considering performance, optimization should always be done at the end. However, I think those design decisions should be at least partially based on the common usage patterns. Realistically, if you allow as much flexibility as gmso currently plans to, there will be some ways to use the package that are faster than others. Abstract concerns about proper design can be helpful guides, but IMO choosing the design that is most amenable to the most common usage patterns should be weighted more heavily. For instance, relating to #381 I definitely see the argument but I'd like to play devil's advocate and question the premise: is this degree of flexibility really necessary in all cases, or can we place some constraints (at least in certain contexts such as Site
s that are members of Topology
objects) in order to simplify and optimize the implementation for common use cases where that flexibility is typically not needed? I'm not the right person to answer that question, but now seems like the right time to ask it :)
To answer some of the questions in the top comment: Topologies:
Sites:
I hope this answers your questions @vyasr.
@uppittu11 thanks for the answers, let me try and follow up.
Topologies
IndexedSet
is to support constant time __contains__
? I can imagine that's important for when you do topology.add_connection
to make sure that sites exist, otherwise that could be very slow. I don't actually see anywhere that takes advantage of the indexing property, though, so is there a reason you can't just use a normal set
? It's not clear to me exactly what characteristics each of the data stores in a topology need to have, so I'm trying to get a grasp on that to figure out the right data structure. Let me try and make a list, let me know if I'm missing anything:
Sites
Site
would have fields like mass
and charge
. Specifically regarding position, what I'm getting at is the following: Is the position of a Site
meaningful outside of a topology? I understanding wanting to be able to work with some object like a Site
outside the context of a topology because it allows you to create something with a certain AtomType
(for instance) and then assign it to multiple topologies, but I'm not convinced that it's location in physical space is meaningful without the coordinate system as defined by the topology.atom_type
).Addressing parts of points 2 and 3 regarding sites:
The reason for having a charge
attribute for Site
is that charges sometimes get tweaked after the atom typing step. An example of this is when we want to model a charged surface by adjusting the partial charges of the surface atoms. I think the current idea is that Site.charge
would override the charge in the Atomtype
, which is discussed in #225.
As for mass, I'm not really aware of any cases where an atom's mass differs from the mass defined by the atom type. If this holds true, maybe we don't need to store mass as an attribute of Site
.
Other than that, I also agree that positions shouldn't change in Topology.
Got it, I understand wanting to tweak charges. However, it's never quite that ad hoc, is it? If you want to assign different charges to surface atoms or ligands chelating some metal or whatever, you're usually dealing with essentially another class of particles. In other words, you're defining a different AtomType that applies to some Subtopology of your system (notwithstanding our discussion yesterday about whether Subtopology should exist or not). mbuild and foyer are designed to make it easier to create and run simulations across various simulation engines, but they're also designed to facilitate TRUE simulations, which in this case means making it easier to disseminate force fields. That being the case, it seems to me that the feature you really want is a way to perform post hoc modifications of a force field.
In principle (I don't know if SMARTS are actually flexible enough for this to be true), any change you want to make to specific charges should be possible to encode in a force field file itself. Then you could actually disseminate the specific force field file used for a given simulation and someone could apply it to a different molecule that they want to model the same way. If that's true, then maybe the real feature you need is to simplify the process of generating customized force fields, since I assume that the real barrier to entry here is actually writing XML files with the appropriate SMART strings for people who aren't as familiar with the nomenclature.
Maybe @rsdefever could explain this better, but I think the main reason we wanted to make it easier and more efficient to calculate and assign partial charges to a system. For most forcefields (like GAFF), nonstandard residues are assigned partial charges from ab initio/semi-empirical calculations, which means that even though a group of atoms are of the same atom type, they may all have slightly different charges. So, we made charge a Site property so that the redundant atom type information (like vdW eqn+parameters) is not repeated for every atom.
There may be a better way to be able to customize the charges for every atom in a molecule, so we're open to suggestions on that.
Regarding mass as a site property, I guess the only use case is CG systems where the atoms -> bead mapping isn't always the same. For hexane, for example
CH3-CH2-CH2-CH2-CH2-CH3 -> "C2"-"C2"-"C2"
would have different masses for the middle C2 vs the two terminal C2s, even if the atom type is the same.
I don't have much to add here. I think @uppittu11 and @rmatsum836 have covered things pretty well. AFAIK charges are pretty frequently modified. Its not necessarily totally "ad hoc", -- e.g., GAFF does not provide any partial charges, but rather points to a procedure that should be followed for their calculation. Even in other FFs, modifying partial charges is sometimes used as a first pass for modifying/improving force field parameters for specific molecules. We can debate the merits of whether this approach is in line with TRUE simulations, but its a common use case. If we want GMSO to become a standard (thereby nudging everyone towards more reproducible science), we need to support common workflows. IMO we should support applying some forcefield for bonded params and non-charge non-bonded params, and then running a partial charge calculation and storing those charges in a GMSO object. Even consider the use case of literal ad-hoc modifications to charges. As long as that procedure can be well encapsulated in a script, I don't think its per-se in conflict with TRUE simulations.
From a design standpoint, I think mass
and charge
attributes are reasonable for a Site
. They are properties of a point-particle. IMO they are in some ways less of an interaction parameter than other non-bonded params; there isn't much debate about the proper functional form for two point charges (or point masses, in case of gravity) interacting with each other [under classical mechanics]. Since we don't normally have gravity, the role of mass is actually in the EOMs anyway. Either way I think we argue that mass
and charge
are Site
properties. There is probably a counter to that argument, but it seems reasonably defensible.
From a practical standpoint, @rmatsum836 I'm not aware of any cases where the mass would be decoupled from the atom type.
Got it. I think I'm reasonably convinced both that Sites should know their masses and charges, and that modification of these (particularly charges) after the fact is both acceptable usage and something that GMSO should support. In that case I'm wondering if maybe I should go in the opposite direction on one of my original suggestions, and maybe wonder if the solution is less normalization instead of more. Once a force field is applied, perhaps the mass and charge of the Site should be the sole source of truth, so you never look at the atom type. If these fields are going to exist anyway, practically speaking it might make more sense to always populate them once the force field is applied and then never refer to the atom type for those again.
Also, still curious about whether it really make sense for Sites to know their positions. Does a Site's absolute position have any real meaning outside of a Topology?
I don't think a Site's position has any meaning outside of a topology. Do you think that the positions for each Site should be stored in the Topology object instead? The only downside is that this would add a little bit more bookkeeping in the Topology to make sure that Sites and their positions are mapped correctly. It's definitely doable and I'm open to that idea if it will be useful.
One practical use case is if you can select Sites that belong to a specific region of your simulation box (like "z-coordinate < 10 nm" or something). That would probably be more easily done if positions were stored in the Topology as a np.Array.
I think at this stage the discussion is more about making conceptually the "right" design decisions as much as possible, since during early development some choices are inevitably made more out of a desire for convenience and rapid development. Given our discussion above on why mass and charge should be properties of a Site, I think in that same vein it makes sense to remove position and put it in topology, since I would define a topology as "a collection of sites at specific positions with some connectivity". A site should know enough to define itself, and a connection should also know enough to define itself, but a lot of the information relating them (such as where they are and how they're connected) are really properties of the topology, not the individual classes. As far as usefulness, I think it would dramatically simplify and speed up any operations that require acting on all the positions. It doesn't sound like that's a performance issue right now, but the positional filtering example in @uppittu11's last post would definitely be a practical benefit.
I guess that naturally leads to questions about how connections work. Connections seem like the trickiest aspect to me, because conceptually you could argue that they have meaning outside a topology, and clearly gmso devs seem to think there is value in being able to work with connections through the kind of API that's more than just a pair of indices mapping to another list of Site instances (correct me if anyone disagrees with that). My understanding is that you would want users to be able to do something like modify the specific site that's a member of a connection through an api like topology.bonds[i].connection_members[0] = Site(...)
, is that right? However, there's significant overhead associated with maintaining one list of Connection objects for each type of Connection, as opposed to something like a 2d numpy array of unsigned ints that index into a topology's list of sites. If it's important to both enable the kind of API that I wrote, but also operate on connections efficiently for more internal bookkeeping operations, then you want to create some intermediate representation that transparently supports both numpy-like indexing/slicing while also providing access to individual sites. Does that sound like something that gmso needs, or am I missing the desired behavior here?
Also, so it doesn't get lost, any thoughts on my discussion of IndexedSet above?
I'm no longer officially a stakeholder in this project, but I hope some of my mad ramblings can be of value [~:
In most simulation codes, there is no data structure that concretely and intentionally represents a molecule. Everything has a high-level structure and low-level atoms, but things get murky in between. Biophysics simulations have infrastructure to keep track of different parts of bipolymers (a single peptide/base pair is a residue, residues make up chains, chains make up segments .... or something like that). For as long as we've been using ParmEd as our backend, we've hacked residues to represent molecules (loosely defined as a collection of bonded atoms). But what if you wanted to keep track of a group of atoms, like grouping the solvent phase together, or group together all atoms in a nanoparticle? Residues can't "contain" other residues, we this sort of this has been impossible. Building something up from scratch, it makes no sense to fence ourselves in like this, so you could think of having a distinct Molecule
or MoleculeCollection
class. But what if you wanted to collect all atoms in a single nanoparticle in a group, but also be able to keep track of all of your 100 nanoparticles? This is the sort of thing that mBuild's hierarchy (everything is a Compound
, Compound
s can contain other Compound
s recursivly) is well-suited for. So, even though it's not really implemented yet, the goal of SubTopology
was to be flexible in a similar way, so that it can arbitrarily represent these sorts of collections.
The reason that I don't envision looping Topology
s on top of each other is that, in my opinion, there will inevitably be some information that must be stored at the highest level, and I think it's important to carry that distinction through, i.e. forbid some metadata from being stored in a SubTopology
. The easy examples can be argued either way, i.e. a Topology
is the place to store box information, but it's not crazy to imagine box information being valuable to a SubTopology
(i.e. a droplet of water on a gold crystal, the entire system is going to need to know its box but maybe it would be useful to keep track of the "box" that represents the gold). It gets hairier when thinking about storing other things, i.e. can two SubTopology
s have different force fields? Maybe, but what happens with things like combining rules and non-bonded cutoffs? These sorts of conflicts are only going to pop up more down the line. Perhaps this could be implemented as a "None
" special case of Topology
, but it seems like that could involve turning off a lot of stuff.
IndexedSet
s as themselves instead of tuples.If so, I'm wondering if it makes more sense to subclass site for specific types (e.g. Atom) and put the properties there
I arrived at this conclusion from a different path (which probably means it's a good idea!) but I do think that refactoring our current use of Site
as an atom would be beneficial. This would involve a refactor (potentially significant) to create an Atom
class that subclasses out of Site
and does things more specific to what an atom represents. This should also involve some other classes that represent UA/CG beads, virtual sites, etc., even if they are unimplemented at first. Polygons like the ones the Glotzer lab spends their days with probably shouldn't be implemented as Site
s since, as I understand them, they represent distinct 3-D shapes with volume and not discrete points in space (whether you are doing atomistic or coarse-grained simulations, the atoms/beads don't really have "volume," inherently, they are effectively just points no matter the potentials applied to them). I have not thought as much about these last points, but: I am inclined to suggest that the Site
class should never be used directly and should contain a pretty broad set of information, even if a lot of it is un-used in some of its child classes.
Yes, Site
s should know their positions, for reasons stated above.
I pretty much agree with @uppittu11 above; I view the modifications we want to support are going to be a handful of one-off modifications for which performance shouldn't be an issue, i.e. translating everything by [0, 0, 10] * u.nm
, although even simple things like that should arguably be handled by mBuild
(I wonder if it's worth thinking about functions here that internally call mBuild
to do rotations/translations/etc., or if it's better to just allow assignment of position values). However, there may be some value in an array view for convenience, for example if you just wanted to do a histogram of the Z positions of every atom in your system (although arguably that should be sent off to post-processing entirely ....)
This sounds like a useful feature, especially if things could also be used as a launching off point for serialization (i.e. if you can get something into a dict/JSON/pandas, you're either pretty close to safe serialization or you already have it).
I do think that we want to avoid duplication but I'm burnt out from many hashing-related heaches last year, so I will defer to those working on the implementation.
@mattwthompson thanks for the response! That provides a lot of extra useful information.
Thanks for the explanation of subtopologies, that gives me a better idea of the motivation. It is important to be able to deal with some intermediate level of structure. Here I'm not completely sure where you envision the division between mbuild and gmso, but my understanding is that you want gmso to support building up topologies by constructing pieces of them as subtopologies and then adding them together via Topology.add_subtopology
. However, the way it's currently set up the subtopology has no knowledge of its connectivity, because a subtopology is just a collection of sites. Conversely, I would expect to be able to construct e.g. different residues of a protein with appropriate connectivity independently of one another and then add them to a topology and have those bonds get combined in some way. With no thought given to implementation questions, is this the intuitive behavior that you would want? My understanding of the current implementation is that you would add a subtopology but then have to add desired connections to the topology after the fact.
I think applying different force fields to different parts of a topology is actually a use case that was discussed as something gmso should support during one of the recent dev calls, but I could be wrong. Box is actually a weird case; tbh I'm not sure it really fits in a topology. I would almost want to create a class (System
, Frame
, Universe
, Snapshot
... pick your favorite term from the various existing packages that do this) that contains a topology and a box (and possibly other info if needed). In the example you gave of the gold crystal I imagine (correct me if I'm wrong) that what really might be of interest is the bounding box, which could be computed as a simple np.min
and np.max
of the site positions. But to me a box isn't a property of a topology or a subtopology, and in fact I could imagine a system composed of a box and multiple topologies if, for instance, topologies were required to be connected graphs, which is not a requirement currently imposed by gmso.
All of this is to say, to me it seems like the goal of representing intermediate levels of organization like molecules is actually better served by nesting topologies than by creating a separate subtopology class, especially since in cases like protein quaternary structures it might really make sense to treat the individual chains independently.
As far as Sites, I think you all have different expectations on how you would work with these data structures than we do, so perhaps my commentary on this topic is less useful. I think we may have made some useful progress on how to handle some of the conflicts in recent dev discussions, but I'm not sure they're really compatible with some of the ideas here so for now I'll leave that up for the other core gmso devs.
I'm closing this issue since I don't think there are any actionable items from here. Other issues like #406 can address more concrete action items that can hopefully be informed by this discussion.
This issue is an offshoot of #385 which I'm using to document a number of questions about the data structures in GMSO. I took a brief look at GMSO's core this morning, and here are my somewhat organized thoughts. I'm sure a lot of these are dumb questions that have already been considered and addressed in during the development process, but hopefully the viewpoint of someone on the outside looking in will help clarify some of the concepts. I can update this if I think of more things in the future.
Topologies
None
) and reusing that class. Is there really a meaningful difference between what could constitute a topology and what could constitute a subtopology, and are there limits to the nesting?Sites
site
to beNone
if desired. What is the most general kind of site that could be permitted? A site with name "Site" and position(0, 0, 0)
, with all other properties asNone
? If so, I'm wondering if it makes more sense to subclass site for specific types (e.g.Atom
) and put the properties there.site_positions
as a NumPy array of positions,site_types
as a 1D NumPy array of indices intounique_sites
, which would contain actualgmso.Site
objects that have all the information they currently do except positions. This would also save you on the redundancy of having e.g. hundreds of Carbon sites on a protein backbone that are basically all storing identical information other than position. You could do similar things withConnection
s so they don't have to individually own sites, but that might be trickier depending on how independent you wantConnection
s to be from theTopology
(i.e. I'm guessing it's not enough to just store indices because I assume you want to be able to work with them somewhat independently). Fwiw, I believe MDAnalysis went through a similar process where they refactored their internal data structures from using a list of Atom objects to a NumPy array of positions and separating out the atoms because they realized the immense performance differences there. As I'm typing out this suggestion, I'm realizing that this is much closer to HOOMD's organization. I'm certainly probably a bit biased because of my extensive recent usage of HOOMD, but I have worked pretty heavily with GROMACS and a bit with LAMMPS in the past, and I've dabbled with MDAnalysis and MDTraj, so hopefully this viewpoint is still somewhat engine-agnostic.IndexedSet
objects suggests to me that you want to avoid duplication. If that's the case, I would implement a custom__hash__
method to clearly indicate what constitutes a uniqueSite
for insertion into those sets (and perhaps also for the variousConnection
subclasses).