mosdef-hub / gmso

Flexible storage of chemical topology for molecular simulation
https://gmso.mosdef.org
MIT License
52 stars 32 forks source link

Hierarchy of class structures #387

Closed vyasr closed 3 years ago

vyasr commented 4 years ago

This issue is an offshoot of #385 which I'm using to document a number of questions about the data structures in GMSO. I took a brief look at GMSO's core this morning, and here are my somewhat organized thoughts. I'm sure a lot of these are dumb questions that have already been considered and addressed in during the development process, but hopefully the viewpoint of someone on the outside looking in will help clarify some of the concepts. I can update this if I think of more things in the future.

Topologies

  1. What exactly is the purpose of a subtopology? Is the goal to represent something like a chain in a protein, since for instance a PDB file can contain multiple chains? I'm trying to understand why it makes sense to have a distinct subtopology object, rather than simply allowing topologies to have a parent (that could be None) and reusing that class. Is there really a meaningful difference between what could constitute a topology and what could constitute a subtopology, and are there limits to the nesting?
  2. The properties all have to convert from an IndexedSet to a list every time they're called; is that really necessary? I don't know exactly what the use cases you foresee for these are, but would it be so bad to force the user to do the conversion since it could be kind of expensive if they put it in a loop or something?

Sites

  1. The Site class claims to "as general as possible" to support all kinds of simulations. This is handled by allowing all of the properties of a site to be None if desired. What is the most general kind of site that could be permitted? A site with name "Site" and position (0, 0, 0), with all other properties as None? If so, I'm wondering if it makes more sense to subclass site for specific types (e.g. Atom) and put the properties there.
  2. Should sites really know their positions, or are they more about the "nature" of the site (i.e. a certain kind of atom or coarse-grained bead). This point and (1) are pretty closely related to point (3) in #385 I believe.
  3. I've noticed multiple issues and conversations associated with performance, and I think some form of database normalization might substantially improve this (although I don't know enough about the actual cases where you're running into performance issues, so I could be way off). If you're unfamiliar with the concept, the idea comes from relational databases like SQL where you reduce redundancy by storing unique entities in separate tables and have everything else point to that (I'm being very rough and skipping a lot of details, but this is the core idea that would be relevant here). I think resolving this would require having a very clear idea of exactly what constitutes a site, but what I'm suggesting would be to instead (at least internally) treat sites more like "site types" as per (2). Topologies would then store separate fields: site_positions as a NumPy array of positions, site_types as a 1D NumPy array of indices into unique_sites, which would contain actual gmso.Site objects that have all the information they currently do except positions. This would also save you on the redundancy of having e.g. hundreds of Carbon sites on a protein backbone that are basically all storing identical information other than position. You could do similar things with Connections so they don't have to individually own sites, but that might be trickier depending on how independent you want Connections to be from the Topology (i.e. I'm guessing it's not enough to just store indices because I assume you want to be able to work with them somewhat independently). Fwiw, I believe MDAnalysis went through a similar process where they refactored their internal data structures from using a list of Atom objects to a NumPy array of positions and separating out the atoms because they realized the immense performance differences there. As I'm typing out this suggestion, I'm realizing that this is much closer to HOOMD's organization. I'm certainly probably a bit biased because of my extensive recent usage of HOOMD, but I have worked pretty heavily with GROMACS and a bit with LAMMPS in the past, and I've dabbled with MDAnalysis and MDTraj, so hopefully this viewpoint is still somewhat engine-agnostic.
  4. Depending on what modes of access are important, you could always provide properties that automatically construct convenience objects similar to the current API if people really want them (you could even toss this in something like a pandas DataFrame if that would be useful).
  5. Your extensive usage of IndexedSet objects suggests to me that you want to avoid duplication. If that's the case, I would implement a custom __hash__ method to clearly indicate what constitutes a unique Site for insertion into those sets (and perhaps also for the various Connection subclasses).
umesh-timalsina commented 4 years ago

@vyasr Thank you for review of GMSO repository. I will be able to answer the discussion around performance and hashing:

  1. The justification for Sites and Connections not being hashable, is because they are mutable and hence this will cause a problem of ghost elements, unintended index changes etc. We deal with with hashable Potential and AtomType(which are also mubtable) objects by tracking changes on them once they within the realms of a Topology. The Connection and Site object represent a physical entity and hence two sites or two connections will never be the same. I agree that the properties could use some form of optimization, because forming objects on the fly is not good.
  2. If you're unfamiliar with the concept, the idea comes from relational databases like SQL where you reduce redundancy by storing unique entities in separate tables and have everything else point to that (I'm being very rough and skipping a lot of details, but this is the core idea that would be relevant here.

This is done for AtomTypes, BondTypes, AngleTypes, DihedralTypes, ImproperTypes. And Sites and Connections were meant to be redundant. One thing that could improve performance is to define a proper hierarchy such that updates to every lower member in the hierarchy wouldn't need an orchestration of updates from Topology.

vyasr commented 4 years ago

@umesh-timalsina thanks for the response! Here are my thoughts.

  1. Wanting Sites and Connections to be mutable makes sense for some use cases, but in that case I'm not sure what the point of using IndexedSets to store them in the Topology is. If the goal is just to avoid users from putting the exact same Site object into the Topology but the user could create two identical but distinct Site objects (i.e. same name, position, etc), then I would argue that this is a very weak protection and not worth using such a data type. In that case, I would just use a simple list to store these instead. Is that unviable for a different reason?
  2. What is the rationale for Sites and Connections being redundant in the way that you describe? This is an area where my lack of experience makes it hard to know what the right suggestion is, but ultimately I'm thinking that (assuming the user needs an API to access a Topology's sites through an API of Site instances) there's a tradeoff here. One option is to do what is currently being done, which is literally storing a set of Site objects in an internal collection and present those to the user whenever they query the relevant property. This is the easiest method to implement. The alternative is to parse provided sites into some internal data structures (perhaps what I proposed, perhaps something else), but provide convenience APIs for the user to be able to interact with that data structure as though it was a list of Sites. This method is harder, but would allow you to use internal data structures that are much more amenable to fast batch processing when it's needed. This method would also be facilitated by ducktyping inputs to add_site to allow either gmso.Site objects or other inputs, for instance tuples like (position, element) or so, but that could be a nice feature in any case. There are details to work out such as how you would create the underlying gmso.Site object, but I don't think that kind of thing is an insurmountable barrier.
umesh-timalsina commented 4 years ago

@vyasr The use of IndexedSet is primarily done for O(1) access for __contains__, which I believe is possible with a collection like set/dictionary and I agree that it will not solve the redundancy problem. But, in terms of redundancy, there is no meaningful insight I could provide because I am in the same boat as you when it comes to experience in this domain. Once, we decide the hierarchy of objects(#385) or a free-form implementation (#381), I believe performance hits we get can be resolved in a meaningful way by the use of proper data-structures/ and or design patterns.

vyasr commented 4 years ago

I see, if that's the primary reason then that makes sense. I don't know enough to know whether x in l is a common enough need to justify that, but it seems reasonable.

I agree that those decisions need to be made before considering performance, optimization should always be done at the end. However, I think those design decisions should be at least partially based on the common usage patterns. Realistically, if you allow as much flexibility as gmso currently plans to, there will be some ways to use the package that are faster than others. Abstract concerns about proper design can be helpful guides, but IMO choosing the design that is most amenable to the most common usage patterns should be weighted more heavily. For instance, relating to #381 I definitely see the argument but I'd like to play devil's advocate and question the premise: is this degree of flexibility really necessary in all cases, or can we place some constraints (at least in certain contexts such as Sites that are members of Topology objects) in order to simplify and optimize the implementation for common use cases where that flexibility is typically not needed? I'm not the right person to answer that question, but now seems like the right time to ask it :)

uppittu11 commented 4 years ago

To answer some of the questions in the top comment: Topologies:

  1. In my view Subtopologies are supposed to define a subset of Sites within a Topology object. They were originally intended to mirror the hierarchy of Compounds in mbuild. I like the idea of getting rid of the Subtopology and having a hierarchy of Topologies within Topologies since this is directly comparable to Compounds within Compounds in mbuild, but I'm not entirely sure how this would be implemented.
  2. I agree that converting the IndexedSet to a tuple every time we access topology.atom_types is super inefficient. I think we discussed this here (https://github.com/mosdef-hub/gmso/issues/364#issuecomment-601956770) when we found that looking up the indices of sites was a huge bottleneck. The issue with returning the original IndexedSet (topology._atom_types) is because we don't want the user to be able to access and modify that IndexedSet since it could mess up the bookkeeping (for example if someone added an connection to topology._connections, but didn't add the site that are part of the connection to the topology). One possible way to get around this (and I prefer this method) is returning a generator object from topology.atom_types() (See https://github.com/mosdef-hub/gmso/issues/372).

Sites:

  1. What other "types" of sites would we have besides Atom and Bead? Do you envision a way to be able to set up some of the HPMC shape simulations with GMSO? If so, that would be super cool.
  2. In the typical workflow, we place all of our molecules into their initial configuration with mBuild, then we convert to GMSO and apply forcefield parameters to the sites and connections, and then we save out the GMSO topology to disk. The reason we want to separate Sites and AtomTypes is because there are many sites that can share the same chemical information (AtomType). So having those sites point to the same AtomType rather than actually storing that information themselves would save a lot of the memory cost. So to answer your question, Sites contain its positions and point to the atom type, which is all the info you need to save the Topology to disk.
  3. Regarding performance and optimization, I think most of the operations we will be doing with topology are lookups (e.g. looking up the AtomType of a Site), so I think dictionaries are going to be the go-to data structure for those. Array operations might be useful for modifying positions, but I would argue that the positions of sites shouldn't change at all in a Topology, since that is the job of mBuild. It definitely makes sense to separate positions into a speedy np.Array for mdtraj/mdanalysis and hoomd since the positions will be regularly updated and operated on. But it's not really necessary for GMSO since positions should mostly be constant after the Topology is created.
  4. I think this is a really good idea. Pandas dataframes would be helpful for visualizing the forcefield parameters and comparing forcefields and that sort of thing. Also, I think having some sort of networkX graph visualization might be useful for visualizing connectivity in the system.
  5. In my view, the hash for Sites should just be its object ID since every Site instance is unique. For Connections it might be useful to define a hash. I may include that in #384 if this is something everyone is onboard with.

I hope this answers your questions @vyasr.

vyasr commented 4 years ago

@uppittu11 thanks for the answers, let me try and follow up.

Topologies

  1. I'm happy to help if there are technical roadblocks to achieving this, if you (and others) agree that this makes sense conceptually.
  2. Is the main reason for using IndexedSet is to support constant time __contains__? I can imagine that's important for when you do topology.add_connection to make sure that sites exist, otherwise that could be very slow. I don't actually see anywhere that takes advantage of the indexing property, though, so is there a reason you can't just use a normal set? It's not clear to me exactly what characteristics each of the data stores in a topology need to have, so I'm trying to get a grasp on that to figure out the right data structure. Let me try and make a list, let me know if I'm missing anything:
    • Fast existence tests so that avoiding duplicates on insertion is fast.
    • Uniqueness. It's unclear exactly what makes different objects unique though. For Sites, it sounds like you're OK with having two identical Sites as long as they're distinct objects, but distinct Connection objects that connect the same Sites are not allowed. Is that right?
    • Possibility for returning immutable copies to client code fast.

Sites

  1. Yeah I could imagine ways that you could parameterize an HPMC "force field" based on a specific type of Site. Moreover, "Bead" is pretty generic; you could definitely have different types of Beads that support specific additional fields. So at the moment I'm, just trying to make sure I understand exactly what constitutes a Site so that I understand where you might draw that line.
  2. I agree that it makes sense to do it that way. What confuses me then is why a Site would have fields like mass and charge. Specifically regarding position, what I'm getting at is the following: Is the position of a Site meaningful outside of a topology? I understanding wanting to be able to work with some object like a Site outside the context of a topology because it allows you to create something with a certain AtomType (for instance) and then assign it to multiple topologies, but I'm not convinced that it's location in physical space is meaningful without the coordinate system as defined by the topology.
  3. I think here is where my lack of knowledge w.r.t the exact goals of GMSO makes it hard to know how things should be designed. If positions really shouldn't change once a topology is created, my question would be: what should change? Deciding what's important to be mutable at this stage would help a lot in terms of deciding what data structures are really necessary, for instance this suggests to me that Sites really don't need to be that mutable at this stage (except maybe the atom_type).
  4. networkX stuff could definitely be really cool.
  5. I think I'll be able to comment more intelligently here based on your thoughts on some of the other points.
rmatsum836 commented 4 years ago

Addressing parts of points 2 and 3 regarding sites:

The reason for having a charge attribute for Site is that charges sometimes get tweaked after the atom typing step. An example of this is when we want to model a charged surface by adjusting the partial charges of the surface atoms. I think the current idea is that Site.charge would override the charge in the Atomtype, which is discussed in #225.

As for mass, I'm not really aware of any cases where an atom's mass differs from the mass defined by the atom type. If this holds true, maybe we don't need to store mass as an attribute of Site.

Other than that, I also agree that positions shouldn't change in Topology.

vyasr commented 4 years ago

Got it, I understand wanting to tweak charges. However, it's never quite that ad hoc, is it? If you want to assign different charges to surface atoms or ligands chelating some metal or whatever, you're usually dealing with essentially another class of particles. In other words, you're defining a different AtomType that applies to some Subtopology of your system (notwithstanding our discussion yesterday about whether Subtopology should exist or not). mbuild and foyer are designed to make it easier to create and run simulations across various simulation engines, but they're also designed to facilitate TRUE simulations, which in this case means making it easier to disseminate force fields. That being the case, it seems to me that the feature you really want is a way to perform post hoc modifications of a force field.

In principle (I don't know if SMARTS are actually flexible enough for this to be true), any change you want to make to specific charges should be possible to encode in a force field file itself. Then you could actually disseminate the specific force field file used for a given simulation and someone could apply it to a different molecule that they want to model the same way. If that's true, then maybe the real feature you need is to simplify the process of generating customized force fields, since I assume that the real barrier to entry here is actually writing XML files with the appropriate SMART strings for people who aren't as familiar with the nomenclature.

uppittu11 commented 4 years ago

Maybe @rsdefever could explain this better, but I think the main reason we wanted to make it easier and more efficient to calculate and assign partial charges to a system. For most forcefields (like GAFF), nonstandard residues are assigned partial charges from ab initio/semi-empirical calculations, which means that even though a group of atoms are of the same atom type, they may all have slightly different charges. So, we made charge a Site property so that the redundant atom type information (like vdW eqn+parameters) is not repeated for every atom.

There may be a better way to be able to customize the charges for every atom in a molecule, so we're open to suggestions on that.

Regarding mass as a site property, I guess the only use case is CG systems where the atoms -> bead mapping isn't always the same. For hexane, for example

CH3-CH2-CH2-CH2-CH2-CH3 -> "C2"-"C2"-"C2"

would have different masses for the middle C2 vs the two terminal C2s, even if the atom type is the same.

rsdefever commented 4 years ago

I don't have much to add here. I think @uppittu11 and @rmatsum836 have covered things pretty well. AFAIK charges are pretty frequently modified. Its not necessarily totally "ad hoc", -- e.g., GAFF does not provide any partial charges, but rather points to a procedure that should be followed for their calculation. Even in other FFs, modifying partial charges is sometimes used as a first pass for modifying/improving force field parameters for specific molecules. We can debate the merits of whether this approach is in line with TRUE simulations, but its a common use case. If we want GMSO to become a standard (thereby nudging everyone towards more reproducible science), we need to support common workflows. IMO we should support applying some forcefield for bonded params and non-charge non-bonded params, and then running a partial charge calculation and storing those charges in a GMSO object. Even consider the use case of literal ad-hoc modifications to charges. As long as that procedure can be well encapsulated in a script, I don't think its per-se in conflict with TRUE simulations.

From a design standpoint, I think mass and charge attributes are reasonable for a Site. They are properties of a point-particle. IMO they are in some ways less of an interaction parameter than other non-bonded params; there isn't much debate about the proper functional form for two point charges (or point masses, in case of gravity) interacting with each other [under classical mechanics]. Since we don't normally have gravity, the role of mass is actually in the EOMs anyway. Either way I think we argue that mass and charge are Site properties. There is probably a counter to that argument, but it seems reasonably defensible.

From a practical standpoint, @rmatsum836 I'm not aware of any cases where the mass would be decoupled from the atom type.

vyasr commented 4 years ago

Got it. I think I'm reasonably convinced both that Sites should know their masses and charges, and that modification of these (particularly charges) after the fact is both acceptable usage and something that GMSO should support. In that case I'm wondering if maybe I should go in the opposite direction on one of my original suggestions, and maybe wonder if the solution is less normalization instead of more. Once a force field is applied, perhaps the mass and charge of the Site should be the sole source of truth, so you never look at the atom type. If these fields are going to exist anyway, practically speaking it might make more sense to always populate them once the force field is applied and then never refer to the atom type for those again.

Also, still curious about whether it really make sense for Sites to know their positions. Does a Site's absolute position have any real meaning outside of a Topology?

uppittu11 commented 4 years ago

I don't think a Site's position has any meaning outside of a topology. Do you think that the positions for each Site should be stored in the Topology object instead? The only downside is that this would add a little bit more bookkeeping in the Topology to make sure that Sites and their positions are mapped correctly. It's definitely doable and I'm open to that idea if it will be useful.

One practical use case is if you can select Sites that belong to a specific region of your simulation box (like "z-coordinate < 10 nm" or something). That would probably be more easily done if positions were stored in the Topology as a np.Array.

vyasr commented 4 years ago

I think at this stage the discussion is more about making conceptually the "right" design decisions as much as possible, since during early development some choices are inevitably made more out of a desire for convenience and rapid development. Given our discussion above on why mass and charge should be properties of a Site, I think in that same vein it makes sense to remove position and put it in topology, since I would define a topology as "a collection of sites at specific positions with some connectivity". A site should know enough to define itself, and a connection should also know enough to define itself, but a lot of the information relating them (such as where they are and how they're connected) are really properties of the topology, not the individual classes. As far as usefulness, I think it would dramatically simplify and speed up any operations that require acting on all the positions. It doesn't sound like that's a performance issue right now, but the positional filtering example in @uppittu11's last post would definitely be a practical benefit.

I guess that naturally leads to questions about how connections work. Connections seem like the trickiest aspect to me, because conceptually you could argue that they have meaning outside a topology, and clearly gmso devs seem to think there is value in being able to work with connections through the kind of API that's more than just a pair of indices mapping to another list of Site instances (correct me if anyone disagrees with that). My understanding is that you would want users to be able to do something like modify the specific site that's a member of a connection through an api like topology.bonds[i].connection_members[0] = Site(...), is that right? However, there's significant overhead associated with maintaining one list of Connection objects for each type of Connection, as opposed to something like a 2d numpy array of unsigned ints that index into a topology's list of sites. If it's important to both enable the kind of API that I wrote, but also operate on connections efficiently for more internal bookkeeping operations, then you want to create some intermediate representation that transparently supports both numpy-like indexing/slicing while also providing access to individual sites. Does that sound like something that gmso needs, or am I missing the desired behavior here?

Also, so it doesn't get lost, any thoughts on my discussion of IndexedSet above?

mattwthompson commented 4 years ago

I'm no longer officially a stakeholder in this project, but I hope some of my mad ramblings can be of value [~:

Topologies

  1. What exactly is the purpose of a subtopology?

In most simulation codes, there is no data structure that concretely and intentionally represents a molecule. Everything has a high-level structure and low-level atoms, but things get murky in between. Biophysics simulations have infrastructure to keep track of different parts of bipolymers (a single peptide/base pair is a residue, residues make up chains, chains make up segments .... or something like that). For as long as we've been using ParmEd as our backend, we've hacked residues to represent molecules (loosely defined as a collection of bonded atoms). But what if you wanted to keep track of a group of atoms, like grouping the solvent phase together, or group together all atoms in a nanoparticle? Residues can't "contain" other residues, we this sort of this has been impossible. Building something up from scratch, it makes no sense to fence ourselves in like this, so you could think of having a distinct Molecule or MoleculeCollection class. But what if you wanted to collect all atoms in a single nanoparticle in a group, but also be able to keep track of all of your 100 nanoparticles? This is the sort of thing that mBuild's hierarchy (everything is a Compound, Compounds can contain other Compounds recursivly) is well-suited for. So, even though it's not really implemented yet, the goal of SubTopology was to be flexible in a similar way, so that it can arbitrarily represent these sorts of collections.

The reason that I don't envision looping Topologys on top of each other is that, in my opinion, there will inevitably be some information that must be stored at the highest level, and I think it's important to carry that distinction through, i.e. forbid some metadata from being stored in a SubTopology. The easy examples can be argued either way, i.e. a Topology is the place to store box information, but it's not crazy to imagine box information being valuable to a SubTopology (i.e. a droplet of water on a gold crystal, the entire system is going to need to know its box but maybe it would be useful to keep track of the "box" that represents the gold). It gets hairier when thinking about storing other things, i.e. can two SubTopologys have different force fields? Maybe, but what happens with things like combining rules and non-bonded cutoffs? These sorts of conflicts are only going to pop up more down the line. Perhaps this could be implemented as a "None" special case of Topology, but it seems like that could involve turning off a lot of stuff.

  1. I don't remember this detail, or if we noticed it along the way. I don't see the harm in trying a patch that returns those IndexedSets as themselves instead of tuples. I will defer to those who are more recently familiar with the implementation here; I'm not sure there's a huge impact on the design philosophy.

Sites

  1. If so, I'm wondering if it makes more sense to subclass site for specific types (e.g. Atom) and put the properties there

I arrived at this conclusion from a different path (which probably means it's a good idea!) but I do think that refactoring our current use of Site as an atom would be beneficial. This would involve a refactor (potentially significant) to create an Atom class that subclasses out of Site and does things more specific to what an atom represents. This should also involve some other classes that represent UA/CG beads, virtual sites, etc., even if they are unimplemented at first. Polygons like the ones the Glotzer lab spends their days with probably shouldn't be implemented as Sites since, as I understand them, they represent distinct 3-D shapes with volume and not discrete points in space (whether you are doing atomistic or coarse-grained simulations, the atoms/beads don't really have "volume," inherently, they are effectively just points no matter the potentials applied to them). I have not thought as much about these last points, but: I am inclined to suggest that the Site class should never be used directly and should contain a pretty broad set of information, even if a lot of it is un-used in some of its child classes.

  1. Yes, Sites should know their positions, for reasons stated above.

  2. I pretty much agree with @uppittu11 above; I view the modifications we want to support are going to be a handful of one-off modifications for which performance shouldn't be an issue, i.e. translating everything by [0, 0, 10] * u.nm, although even simple things like that should arguably be handled by mBuild (I wonder if it's worth thinking about functions here that internally call mBuild to do rotations/translations/etc., or if it's better to just allow assignment of position values). However, there may be some value in an array view for convenience, for example if you just wanted to do a histogram of the Z positions of every atom in your system (although arguably that should be sent off to post-processing entirely ....)

  3. This sounds like a useful feature, especially if things could also be used as a launching off point for serialization (i.e. if you can get something into a dict/JSON/pandas, you're either pretty close to safe serialization or you already have it).

  4. I do think that we want to avoid duplication but I'm burnt out from many hashing-related heaches last year, so I will defer to those working on the implementation.

vyasr commented 4 years ago

@mattwthompson thanks for the response! That provides a lot of extra useful information.

Thanks for the explanation of subtopologies, that gives me a better idea of the motivation. It is important to be able to deal with some intermediate level of structure. Here I'm not completely sure where you envision the division between mbuild and gmso, but my understanding is that you want gmso to support building up topologies by constructing pieces of them as subtopologies and then adding them together via Topology.add_subtopology. However, the way it's currently set up the subtopology has no knowledge of its connectivity, because a subtopology is just a collection of sites. Conversely, I would expect to be able to construct e.g. different residues of a protein with appropriate connectivity independently of one another and then add them to a topology and have those bonds get combined in some way. With no thought given to implementation questions, is this the intuitive behavior that you would want? My understanding of the current implementation is that you would add a subtopology but then have to add desired connections to the topology after the fact.

I think applying different force fields to different parts of a topology is actually a use case that was discussed as something gmso should support during one of the recent dev calls, but I could be wrong. Box is actually a weird case; tbh I'm not sure it really fits in a topology. I would almost want to create a class (System, Frame, Universe, Snapshot... pick your favorite term from the various existing packages that do this) that contains a topology and a box (and possibly other info if needed). In the example you gave of the gold crystal I imagine (correct me if I'm wrong) that what really might be of interest is the bounding box, which could be computed as a simple np.min and np.max of the site positions. But to me a box isn't a property of a topology or a subtopology, and in fact I could imagine a system composed of a box and multiple topologies if, for instance, topologies were required to be connected graphs, which is not a requirement currently imposed by gmso.

All of this is to say, to me it seems like the goal of representing intermediate levels of organization like molecules is actually better served by nesting topologies than by creating a separate subtopology class, especially since in cases like protein quaternary structures it might really make sense to treat the individual chains independently.

As far as Sites, I think you all have different expectations on how you would work with these data structures than we do, so perhaps my commentary on this topic is less useful. I think we may have made some useful progress on how to handle some of the conflicts in recent dev discussions, but I'm not sure they're really compatible with some of the ideas here so for now I'll leave that up for the other core gmso devs.

vyasr commented 3 years ago

I'm closing this issue since I don't think there are any actionable items from here. Other issues like #406 can address more concrete action items that can hopefully be informed by this discussion.