shinzlet / atomCAD

Mozilla Public License 2.0
0 stars 0 forks source link

Edit history & Feature tree #5

Open shinzlet opened 1 year ago

shinzlet commented 1 year ago

We plan for AtomCAD to use a relatively standard feature tree and edit history in order to allow manipulating a complex design after it is made (and also simpler things, like undo/redo). Demanding this introduces several considerations:

What is the actual data structure of the edit history?

The most obvious implementation is Vec<impl Feature> + a usize to indicate where in the feature vec we "currently are" (i.e. the rollback bar position overtop of the edit history). Feature is a trait that describes some robustly specified (i.e. it should avoid breaking when it is moved through the timeline) operation on a molecule.

How do features actually work?

One way of implementing a feature could be to specify some update function that takes a mutable reference to a Molecule and applies the operation to it. This is fully functional, but it forces a full rebuild every time you go back in history, even just by one step.

There are two main optimizations that I can think of to work around this (and they can be composed together). One is to record the full molecule state at several spaced points in the edit history, and then, when moving back or forth, apply features starting from the nearest checkpoint. This improves speed significantly, but would require a significant increase in used memory (unless these checkpoints were stored on disk).

Another optimization is to require features to be more complicated: force them to be able to both apply and undo a transformation. This would allow a single step back to be done by applying one feature in reverse, rather than by rebuilding from the beginning (or the nearest checkpoint). The issue with this is that reverse transformations are extremely nontrivial. For example, imagine that one feature is to delete an entire branch from a central atom. The forward operation is easy to specify - only a few node indexes need to be stored.

The reverse operation, however, requires adding the branch back on - so the feature must store every atom and position and bond that it had deleted, and recall how to reinstall them. These are entirely different functions, one that only needs a small amount of data and one which requires a lot.

On further consideration, it seems that even if features are reversible, checkpoints would still be needed: it should not be computationally intensive to go from viewing the final step in the edit history to the first or vice versa, and without checkpoints you have to apply or un-apply all the features.

Based on this analysis, I think the best approach is to implement forward feature application (leaving the possibility for reverse application) AND on-disk checkpoint storage. For practicality's sake, it might be most effective to keep the checkpoints in memory behind some adaptor, and then only implement the disk storage later down the line when file saving and serialization are implemented.

Rendering?

Rebuilding from features will probably require a full push of the atoms buffer each time. I don't think there's a way around this unless we shard Atoms and add a ton of logic to make sure features report which atoms they effect.

Concurrency and Thread Safety

Although it is not on the horizon, it may eventually be desirable to apply isolated features concurrently during a rebuild (for example, if the parts of the molecule they effect are very far apart / noninteracting). This is a rare case but could provide speedups in certain cases. I'm comfortable ignoring this as being out-of-scope for the time being.

Feature Dependencies & Topological Naming

The current implementation of Molecule uses a petgraph StableGraph. The node indexes that the stable graph produces are deterministic (i.e. the same operations will give the same atoms the same indices), but this is not enough. Assume you have a feature A, which produces an atom (give it index 1), and a feature B, which adds a child to atom 1. If we roll back before feature A, and add another atom somewhere unrelated, that atom will be called atom 1. Then, when feature A runs, it will add atom 2, and so feature B will change its target if it's only instruction is "do this to atom 1".

Instead, feature application has to be specified with respect to previous features. In the previous example, feature B cannot refer to atom 1 - it must instead refer to "the atom produced by Feature A". This is more complicated, but it is much more robust. Additionally, when something is done in the edit history that compromises feature A, that error must also propagate to feature B and so on.

These considerations lead me to believe that a stable graph may not even be needed. The reason I initially opted for it was so that each atom could pick a specific bond as it's primary axis, and then place other atoms about that axis following VSEPR rules. If all features depend on other features, then there's really no need for one atom to refer to another (the feature dependenty system would already provide a way of stably referring to atoms).

As a final note, I think that feature dependency is much more complicated than I am implying. Features will not just introduce a single atom - they will often produce many (patterning, mirroring, etc). Some method will be needed to refer to these, and I'm not sure exactly how ("Feature A/atom 1"? "Pattern Feature/First copy of Feature A/First atom in Feature A"?). Ultimately, every atom that exists needs to be traceable through every feature that created it!

User Interface and Experience

Error Handling and Recovery

We need to ensure that if something goes wrong in a rebuild, that it does not:

shinzlet commented 1 year ago

Currently, the Feature API exposes a method called get_atoms which returns all of the atoms created by a feature (using their AtomIndex). This requires further consideration. Are these recorded when the feature is applied, and undefined beforehand? They likely would need to be, as many features will create a number of atoms which depends on the features before them in the edit history (i.e. a mirror operation). It is also theoretically possible to make this method compute the atom indexes which will be used at invocation time, but this requires a reference to the feature tree, a deterministic index for every node (I think petgraph provides this already), and will have to return a result anyway in case the computation fails due to a dependency error.

If the atom indexes are recorded during application, then there needs to be a mechanism for the editor to invalidate those applications once the molecule is rolled back in the edit history (because those atoms will no longer exist). That or the FeatureList API stores the current position in edit history and prevents its user from accessing the atom indexes for a feature that may or may not resolve.

For now I am leaving this alone, but it will need work. I think the latter approach regarding FeatureList API design is best.

shinzlet commented 1 year ago

I'm struggling to figure out the architecture for dependencies in a way that satisfies all the requirements. My current implementation allows a feature to figure out how it should be applied - i.e. it makes it easy to figure out every atom that would be created by applying the feature. But this is not yet two-way: given an atom in the feature, it is nontrivial to figure out what feature originally created it. To be more explicit, if you create a mirror or pattern feature, and then click on one of the derived atoms, atomCAD cannot trace that atom back to its source.

Another complication is whether a feature that copies others should just copy their atoms or create ghost features that copy them fully. To clarify, imagine the following process: 0: create a central carbon (C) 1: attach a nitrogen to C (C-N) 2: attach a phosphorus to N (C-N-P) 3: mirror the molecule (the bonding between the original and mirrored copy is not important for this example: C-N-P P-N-C) 4: attach a F to the P on the mirrored copy (C-N-P F-P-N-C) 5: roll back in the edit history to after step 1 (just C-N) 6: attach a sulfur after the N (C-N-S) 7: reparent the phosphorus onto the sulfur (C-N-S-P) 8: bring the rollback bar to the end. With a working feature tree, we'd expect the fluorine to still be attached to the phosphorus (C-N-S-P F-P-S-N-C)

If the mirror operation worked by naively copying atoms, it would initially have 3 child atoms after step 3 (P-N-C). So, the AtomSpecifier that anchors the fluorine to the phosphorus in step 4 would refer to MirrorFeature.atom3. After we rollback, though, we have added a fourth atom to the molecule which gets mirrored! So after the mirror operation in the updated future, MirrorFeature.atom3 is a sulfur, not a phosphorus: we would get (C-N-S-P P-S(-F)-N-C), which is different.

So, the mirror operation cannot just copy atoms - it must in some way keep track of which child atoms belong to which historical features, so that future features which reference the mirrored copy can anchor themselves (via AtomSpecifier) onto mirrored features, rather than an unstable index into the mirrored atom collection.

shinzlet commented 1 year ago

Having encountered these issues, I get the feeling that I need to more explicitly plan the feature design. Features are the core of modern CAD workflows:

Requirements:

The last point and its children are extremely important, and is the related to the issue I was encountering before. When copying a feature, the copied atoms should be referred to using feature specifiers, not an index into some atom buffer. Otherwise, the copied atom data is separate from the entire feature paradigm, and references to the copied atoms are brittle.

Implementation

Naming Features

I have already found a satisfactory solution for naming features. Features are stored in a FeatureList container, which is composed of a Vec, a counter, and a HashMap. The Vec stores a list of feature IDs (which are size). The order of the feature IDs in the Vec describes the order in which the features are applied to the molecule. The HashMap maps feature IDs to owned Features (Box), and is used to look up each feature. Every time a feature is created, it's ID is taken from the current value of the counter (which is then incremented by one). When a feature is removed, the counter is not changed. This system ensures that reordering, adding, and removing features will not have any effect on the identifiers of other features.

Referencing Features

To reference a feature, a Feature can just store its identifier (as discussed, this is stable over time).

Referencing Molecules

A similar system to FeatureList could be used to provide molecules with stable identifiers, but more likely they would be referenced by a filename or string identifier that can be used to lookup a specific version of the molecule. To specify a feature or atom in the target molecule, the same mechanisms described here for local features and atoms could be used.

Reparenting Features

A feature can be reparented just by changing the atom / feature references that it stores.

Referencing Atoms

Imagine a simple (not chemically valid) part that is created by the following features:

Feature ID Description visualization
0 create molecule C
1 create & attach silicon atom C-Si
2 passivate the silicon C-SiH3

Feature 0 does not reference anything (it is the feature tree root). Feature 1 references feature 0. Because Feature 0 is primitive (it contains atom data that is not developed by any other features), it is acceptable to just directly index into it: so the silicon created could use a specifier like "feature id: 0, atom index: 0". Note that "atom index: 0" is a reference to its index in the feature's atom buffer, not a reference to its AtomIndex in the molecule graph! Feature 2 references feature 1. This is a different case than before, because feature 1 explicitly depends on feature 0: but no new mechanisms are needed to handle this in a robust way. This is because the feature 1 owns its atom - it did not reference any part of the molecule when choosing to create a silicon, it only referenced feature 0 so that it could anchor that atom somewhere. From this point of view, feature 0 is similar to feature 1 - they both own their atoms, and know where to anchor them. Thus, feature 2 can depend on feature 1 trivially: it just passivates the atom with "feature id: 1, atom index: 0" (and again the index is zero in the feature, although the AtomIndex of the Si is 1 in the molecule graph).

In a Feature with Dependent Geometry

I emphasized how each feature in the previous example owned its atom data, because that is the simplest scenario. Things get much more complicated when a feature does not own the atoms it is creating (i.e. it has geometry which depends on other features). For example:

Feature ID Description visualization
0 create molecule C
1 add a Si C-Si
2 mirror the molecule to duplicate the arrangement C-Si-Si-C
3 passivate the duplicated silicon C-Si-SiH2-C

Features 0, 1, and 2 are still trivial here: Feature 0 does not make any reference to other atoms. Feature 1 references feature 0, but both own their data. Feature 2 references both previous features at atom 0, but it does not own them. Feature 3 is where things get complicated. Assume for a moment that we ignore the earlier rule that atoms cannot be referred to by an index. Feature 3 will then reference feature 2 atom 1 (assuming feature 2 contains some buffer that says "make carbon first, then make the silicon"). If we roll back the feature history and add another atom before the mirror operation, we will change what those buffer indexes mean! Now, when we restore the timeline to after feature 3, feature 2 atom 1 will still be passivated - even though feature 2 atom 1 might not refer to the silicon!

This explains why atom indexing is illegal in most cases. To work around this, feature 3 needs to specify its target atom much more robustly. Instead of asking for "feature 2 atom 1", it must say "atom 0 in the copy of feature 1 that feature 2 made". This solves the problem entirely: feature ids are stable (as mentioned earlier), so mucking around in the timeline will never invalidate this reference. The downside is that figuring out a good way to express this in code is challenging.

The simplest way I can think of to represent this in a data structure is to label every atom with its ownership dependency path (i.e. a list of feature ids that walk you back to its source data). For example, in the C-Si-SiH2-C example before, from left to right, the labels are: C: [0, 0] (owned by feature 0: atom 0) Si: [1, 0] (owned by feature 1: atom 0) Si: [2, 1, 0] (owned by the mirror feature's copy of feature 1: atom 0) C: [2, 0, 0] (owned by the mirror feature's copy of feature 0: atom 0) first hydrogen: [3, 0] (although these reference feature 2, their atom data is owned by the passivation) first hydrogen: [3, 1] (passivation atoms are primitive and owned, its atoms are indexed at 0 and 1 in feature 3)

These atom paths can then be used like indexes. When the passivation feature is run, it knows that It references feature2.feature1.atom0. Figuring out what AtomIndex in the graph this represents is another challenge, but there are obvious solutions (the simplest is having feature 2, upon being applied, create a map that connects these atom paths to their cloned indexes).

The whole solution described here is extremely memory heavy. Some possible optimizations are:

Applying a Feature

shinzlet commented 1 year ago

Design Proposal

After mulling this over for a while, I think that the design requirements restrict us down to only one real solution. Ignore the previous hints at an implementation unless I mention them explicitly, they were mostly brainstorming.

  1. Stable identifiers must exist for every feature and every atom. The FeatureList is effectively unchanged.

  2. An AtomSpecifier type is introduced with the following fields:

    • author_id: The feature ID that actually created an atom (i.e. the feature who's apply method created the node)
    • owner_id: The feature ID that owns that atom's data (may be equal to author_id if this atom was created by a feature that owns atom data)
    • owner_atom_index: The atom index in the owner feature where this atom's source data can be found. Note that this does not imply any sort of deduplication or compression of actual graph node data: both this graph node and the owner atom will store an element, position, etc - this field is just for feature tracking.
  3. The molecule graph needs to be updated so that each node stores its atom specifier (in addition to the regular atom data).

  4. A feature has an apply method that accepts a reference to self, a reference to the FeatureList, and a mutable reference to a command queue of type impl MoleculeCommands. This is quite different than the current design:

    • apply does not mutate the feature. This means that it can always be called again on the same inputs and deliver an identical result. (barring bizarre features that involve external factors like timing delays or RNG, which are not considered)
    • The command queue is the only way for a feature to request changes to the molecule or resolve AtomSpecifiers into AtomIndexes. This means that features can be applied virtually, so earlier user actions can be replayed as a part of derived features (the mirror feature can re-run features by providing its own command queue and rerunning the features, rather than just literally mirroring atoms by their coordinates)
  5. A feature has a get_references(&self) method that returns a list of the AtomSpecifiers that it stores.

Requirements Satisfied

Requirement Implementation Details
A feature must either describe or enact a list of trivial changes to a molecule (i.e. you must be able to use a feature to realize updates on a molecule) ✅ Command queues are used to mediate molecule changes
It must be possible to figure out which features a feature depends on ✅ this can be done by interrogating get_references recursively until root features are found.
It must be possible to figure out which feature (features) that an atom depends on ✅ an atom stores an AtomSpecifier that describes what created it and what feature owns its data. The feature that created an atom can have its lineage traced as just described, so an atom's ancestor features can be found in linear time and with minimal space needs.
It must be possible to figure out which features depend on a given feature ⚠️ This can be done by iterating over every feature and looking at their dependencies, but it is a slow process. Optimizations are possible by storing two-way dependency links between features, but this is out of scope for now.
Features must have stable identifiers (i.e. adding / removing features should not change the identifiers of any other feature. Editing a feature should not change its identifier.) ✅ Implemented via the FeatureList
If a feature is removed, its identifier may be reused - removing a feature should explicitly invalidate all of its dependents 4
Feature operations should fail gracefully, describe the error, and never lead to a crash ✅ This is just a matter of good code quality and error reporting.
It should be possible to reparent a feature and successfully reapply it (e.g. moving a moiety)
It should be possible to create features that reference atoms (e.g. if you want to bond a new atom to an existing one), other features (e.g. if you want to pattern the creation of a feature), or other molecules (e.g. if you want to create a tether between two molecules) ✅ Features can contain arbitrary data (e.g. to reference molecules or files) as well as AtomSpecifiers which robustly track an atom based on its features of origin.
It must be possible to specify an atom exactly only by referencing features (without even knowing an AtomIndex). Atoms can be specified by an index ONLY if:
- you are indexing a feature: e.g. if you create a linear pattern and want to refer to the third copy of a feature, you can index with 3
- you are instantiating a set of atoms of non-atomCAD origin (like a PDB file or a base sketch)
7

TODO: Finish this post. As I was writing this, I realized that one situation that isn't accounted for is repeatedly copying the same atom. AtomSpecifiers are denoted as author_id.owner_id.owner_atom_index after the atom name feature 0. make carbon: C[0.0.0] feature 1. mirror everything: C[0.0.0]-C[1.0.0] feature 2.mirror everything: C[0.0.0]-C[1.0.0]-C[2.0.0]-C[2.0.0] After a while you start getting duplicate AtomSpecifiers, which breaks feature tracking entirely possibly by adding this to AtomSpecifier (but then it gets more confusing with multiple levels of copying)

Using an author copy index: (author_id.author_copy_index, owner_id.owner_atom_index) feature 0. make carbon: C[0.0, 0.0] feature 1. mirror everything: C[0.0, 0.0]-C[1.0, 0.0] feature 2.mirror everything: C[0.0, 0.0]-C[1.0, 0.0]-C[2.0, 0.0]-C[2.1, 0.0]

One concern I have with this is that it exposes a copy index (which does not seem particularly feature-respecting). I believe it may be possible to create conflicting atom specifiers using this copy index: feature 0. make carbon: C[0.0, 0.0] feature 1. Attach silicon: C[0.0, 0.0]-Si[1.0, 1.0] feature 2. make 3 copies: C[0.0, 0.0]-Si[1.0, 1.0]-C[2.0, 0.0]-Si[2.0, 1.0]-C[2.1, 0.0]-Si[2.1, 1.0] Now roll back: feature 3. Attach Germanium: C[0.0, 0.0]-Si[1.0, 1.0]-Ge[3.0, 3.0] Roll the timeline to the end (the feature order is 0, 1, 3, 2): C[0.0, 0.0]-Si[1.0, 1.0]-Ge[3.0, 3.0]-C[2.0, 0.0]-Si[2.0, 1.0]-Ge[2.0, 3.0]-C[2.1, 0.0]-Si[2.1, 1.0]-Ge[2.1, 3.0] Now adjust the pattern feature to only produce 2 copies! C[0.0, 0.0]-Si[1.0, 1.0]-Ge[3.0, 3.0]-C[2.0, 0.0]-Si[2.0, 1.0]-Ge[2.0, 3.0] I had expected this might cause a problem: before we started adjusting the timeline there were six atoms; now there are six different atoms. But the owner feature index mitigated this problem (and would do so even if there were multiple atoms in one feature, thanks to the owner atom index.

This scheme even works when multiple copies are made (erasing the copy index of an intermediate feature doesn't cause problems like I suspected it might) feature 0. make carbon: C[0.0, 0.0] feature 1. copy twice: C[0.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0] feature 2. Copy twice: C[0.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0]-C[2.0, 0.0]-C[2.1, 0.0]-C[2.2, 0.0] all of these atoms are still uniquely specified

feature 0. make carbon: C[0.0, 0.0] feature 1. copy once: C[0.0, 0.0]-C[1.0, 0.0] feature 2. Copy once: C[0.0, 0.0]-C[1.0, 0.0]-C[2.0, 0.0]-C[2.1, 0.0] roll back to feature 1, adjust it so that it copies twice: C[0.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0] restore to the end of feature 2: C[0.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0]-C[2.0, 0.0]-C[2.1, 0.0]-C[2.2, 0.0]

feature 0. make carbon: C[0.0, 0.0] feature 1. copy once: C[0.0, 0.0]-C[1.0, 0.0] feature 2. Copy once: C[0.0, 0.0]-C[1.0, 0.0]-C[2.0, 0.0]-C[2.1, 0.0] Note that C[2.1, 0.0] is currently the image of C[1.0, 0.0]. roll back before 1: feature 3. copy once: C[0.0, 0.0]-C[3.0, 0.0] Now, reapply feature 1: C[0.0, 0.0]-C[3.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0] Now, reapply feature 2: C[0.0, 0.0]-C[3.0, 0.0]-C[1.0, 0.0]-C[1.1, 0.0]-C[2.0, 0.0]-C[2.1, 0.0]-C[2.2, 0.0]-C[2.3, 0.0] After this timeline change, C[2.1, 0.0] is actually the image of C[3.0, 0.0]! This means that the 4-field atom specifier diverges from a full atom lineage specification.

By contrast to the above, if we specify atoms with the tedious full lineage (C[0.3 1.4 2.5] means the sixth copy (index 5) of the third feature's (index 3) copy of (the fifth copy of the second feature's copy of (the fourth atom in primitive feature 0, the first feature))":

Feature 0: C[0.0] Feature 1: C[0.0]-C[0.0 1.0] Feature 2: C[0.0]-C[0.0 1.0]-C[0.0 2.0]-C[0.0 1.0 2.0] Here, C[0.0 1.0 2.0] is the image of C[0.0 1.0] Roll back and add feature 3: C[0.0]-C[0.0 3.0] Move back to end: C[0.0]-C[0.0 3.0]-C[0.0 1.0]-C[0.0 3.0 1.0]-C[0.0 2.0]-C[0.0 3.0 2.0]-C[0.0 1.0 2.0]-C[0.0 3.0 1.0 2.0] C[0.0 1.0 2.0] is still the image of C[0.0 1.0]: all is good :)

I think we can prove information theoretically that there is no way to robustly shorten an atom lineage. Instead, we must make use of heavy caching and use small integer sizes to prevent excessive memory use (this introduces limitations like "a feature cannot make more than 255 - i.e a u8 - copies of something). For now I will not do this.

shinzlet commented 1 year ago

I have started implementing a full-lineage based feature system.