glotzerlab / signac

Manage large and heterogeneous data spaces on the file system.
https://signac.io/
BSD 3-Clause "New" or "Revised" License
130 stars 36 forks source link

Proposal: Unify dict classes and improve buffering and synchronization #249

Closed vyasr closed 3 years ago

vyasr commented 4 years ago

Tl;dr: We need to improve synchronization and caching logic, and I think the first step is to combine the _SyncedDict, SyncedAttrDict, and JSONDict classes.

I apologize in advance for the lengthy nature of this issue. This issue will serve as a pseudo-signac Enhancement Proposal, I'll try and document very thoroughly and it can be a test case for the utility of such proposals :)

In view of our recent push for deprecations and our discussion of reorganizing namespaces and subpackages to prepare for signac 2.0, I'd like to also revisit discussion of the different dict classes. We have various open bugs and features (#234, #196, #239, #238, #198) that are related to improving our synchronization and caching processes. Our synchronization clearly has some holes in it, and in the process of making #239 @bdice has raised concerns about inconsistencies with respect to cache correctness and cache coherence, e.g. the fact that a job that exists and is cached will still exist in the cache after it is deleted (Bradley, feel free to add more information).

Fixing all of these is a complex problem, in part due to fragmentation in our implementation of various parts of the logic. I'd like to use this issue to broadly discuss the various problems that we need to fix, and we can spawn off related issues as needed once we have more of a plan of attack to address our problems. Planning this development more thoroughly is critical since the bugs that may arise touch on pretty much all critical code paths in signac. I think that a good first step is looking into simplifying the logic associated with our various dictionary classes. That change should make it easier to improve #198 since synchronization will be in one place. After that, I think it will be easier to consider the various levels of caching and properly define the invariants we want to preserve.

With respect to the various dictionary classes, I think we need to reassess and simplify our hierarchy:

@csadorf @bdice @mikemhenry any commentary on this is welcome, also please tag any other devs who might have enough knowledge of these topics to provide useful feedback.

csadorf commented 4 years ago

The class hierarchy reflects the subtly different responsibilities needed to implement the job-document and job-statepoint interfaces. While I don't think that the hierarchy is wrong, I understand that the organization with multiple and diamond inheritance is hard to penetrate and that refactorization towards simpler relationship would likely be beneficial.

I am therefore in principle not against a refactorization attempt, however I caution that this would be a non-trivial undertaking and would require very careful implementation and review since this is an extremely sensitive area with respect to data integrity and performance. A lot of effort has been put into the current code base to ensure both of these aspects and refactoring or possibly even replacing this code will need to reconstruct a lot of considerations that went into the current implementation.

Before, we can move forward I would like to see a class diagram of how the revised code base would be organized. I would also like to see a detailed plan on how caching would be achieved. If we are already planning a major revision for this part of the code base, we should revisit the design for signac 2.0.

Here are my recommendations for concrete next steps:

  1. Review the design for signac 2.0. It already contains similar ideas for simplification.
  2. Create a class diagram of what the revised hierarchy would look like.
  3. Ensure that the revised hierarchy allows for the plugin of different caching backends. For signac 2.0 I envisioned that we would be able to use a simple in-memory cache by default, but also allow users to for example employ a redis database for slightly less volatile caching if the data space grew larger.

Please have a look at the signac 2.0 design and the prototype implementation. I am confident that a lot of ideas for simplification and improved caching have already been prototyped there.

Finally, I would recommend that we should implement these core data classes in separate packages. This would help us with a clear separation of concerns, but also enable us to release these packages separately on PyPI if we are so inclined. I would think that a lot of users could have use for a carefully implemented JSONDict and H5Store data class without needing to use signac and its namespace.

vyasr commented 4 years ago

I'm fine with implementing these new versions in a separate package, I agree that it would be beneficial.

I completely agree that some of this is addressed by the signac 2.0 prototype. My primary concern with signac 2.0 was that since the changes were dramatic enough that we were considering rewriting the codebase from the ground up and then introducing compatibility layers, we ran the risk of ending up with a partial solution that wouldn't be a drop-in replacement. I think that revisions to our hierarchy of synced collections is self-contained enough and high leverage enough that we could accomplish it within the existing framework, and reap the benefits even if we made no further progress on the other components of the 2.0 proposal. My suggestion for 2.0 would be to identify similar changes that we could make to 1.x in place, and once we've completed such changes we can reevaluate the magnitude of further changes required for 2.0.

That being said, I think one of the major changes we should make that is orthogonal to the concerns 2.0 addressed is that we need to generalize our syncing to other data structures. In particular, #196 and #198 suggest the need for a more generic SyncedCollection or so. That change would be part of the rewrite proposed here.

Unless I'm very mistaken, I don't think there is any multiple or diamond inheritance. I think that a lot of what makes things confusing is really that the logic for the separation of concerns is not clear, particularly with respect to the buffering and caching logic. In particular, the fact that buffering is implemented at the JSONDict level because that class is used for documents rather than statepoints is not clear without some careful reading of the code base. Additionally, the fact that some of the json file writing logic is in job.py rather than in the dict classes makes it quite difficult to track down bugs or manage synchronization and caching. While some of that separation may be unavoidable, I think it would help us a lot if we can isolate such logic to the dict classes as much as possible, and that's part of what I'd like to achieve here.

I'll post again when I have a class diagram, that's an important first step.

csadorf commented 4 years ago

I'm fine with implementing these new versions in a separate package, I agree that it would be beneficial.

I've talked to @bdice about this and he had some concerns that it would increase our maintenance burden too much. I'm coming around to his view point and also think it's not a huge problem to delay such a split, because we can always do that in subsequent releases without harm.

I completely agree that some of this is addressed by the signac 2.0 prototype. My primary concern with signac 2.0 was that since the changes were dramatic enough that we were considering rewriting the codebase from the ground up and then introducing compatibility layers, we ran the risk of ending up with a partial solution that wouldn't be a drop-in replacement. I think that revisions to our hierarchy of synced collections is self-contained enough and high leverage enough that we could accomplish it within the existing framework, and reap the benefits even if we made no further progress on the other components of the 2.0 proposal. My suggestion for 2.0 would be to identify similar changes that we could make to 1.x in place, and once we've completed such changes we can reevaluate the magnitude of further changes required for 2.0.

I agree that a complete rewrite of the code base would require too much effort, so implementing these improvements in stages is totally reasonable. I j just want to make sure that we take the design for signac 2.0 into consideration when we develop these incremental changes.

That being said, I think one of the major changes we should make that is orthogonal to the concerns 2.0 addressed is that we need to generalize our syncing to other data structures. In particular, #196 and #198 suggest the need for a more generic SyncedCollection or so. That change would be part of the rewrite proposed here.

I'm not sure that a generic SyncedCollection would help in achieving the general goal of simplifying the code base, but I am happy to discuss that as part of a more concrete design.

Unless I'm very mistaken, I don't think there is any multiple or diamond inheritance. I think that a lot of what makes things confusing is really that the logic for the separation of concerns is not clear, particularly with respect to the buffering and caching logic. In particular, the fact that buffering is implemented at the JSONDict level because that class is used for documents rather than statepoints is not clear without some careful reading of the code base. Additionally, the fact that some of the json file writing logic is in job.py rather than in the dict classes makes it quite difficult to track down bugs or manage synchronization and caching. While some of that separation may be unavoidable, I think it would help us a lot if we can isolate such logic to the dict classes as much as possible, and that's part of what I'd like to achieve here.

I'll post again when I have a class diagram, that's an important first step.

👍

vyasr commented 4 years ago

I've finally put together a rough class diagram that describes how I think this should be structured:

simplified

A few notes for those who aren't that familiar with UML class diagram syntax:

The core idea here is to separate what I see as two separate functionalities: 1) synchronization with an underlying file, and 2) recursively keeping a potentially nested collection-like data structure up-to-date. A SyncedCollection encapsulates both of these functionalities, since they are both required, but they can be implemented in separate subtrees in the class hierarchy. Separating these concerns appropriately allows us to ensure that different data types (e.g. lists and dicts) can share synchronization logic for a given backend (e.g. JSON) to whatever extent possible, and different backends can share file-writing logic without worrying about the synchronization. The to_base and from_base functions would be recursive functions that take the place of current methods like _dfs_convert and _dfs_update, which are called when SyncedDicts are modified.

In an ideal world, this would mean that classes at the bottom level of the hierarchy would be completely empty. For example, a SyncedJSONDict would simply inherit from the two parents, which implement all necessary abstract methods and therefore make it a concrete class. Implementing other backends would also be simpler, since adding one class would allow the usage of that backend for all types of collections. In practice certain backends may require special serialization logic for different data types, so we could implement on an as-needed basis. In general, we should only implement the bare minimum subset to start with, but this structure should make this code more extensible if we want to enable something like an SQL backend (for simple, homogeneous schema)

We should define a clear protocol for what a cache "looks like" (in the sense of ducktyping), and allow users to provide any cache supporting this API. We can of course support some caches like Redis. I took into account the way that the signac-2 prototype was structured, but I prefer keeping the cache as a separate argument rather than having users specify an open function as the constructor argument because I think that will be less clear to users.

Possible changes and important concerns:

  1. This does not really simplify or reduce the hierarchy (in fact it probably makes it more complicated by creating a diamond), it just introduces more structure. We could simplify it, but in my opinion the biggest problem right now isn't the complexity of the infrastructure but rather the organization (and probably naming). I think the added structure will help clarify things enough to justify the increased number of classes.
  2. We could use the Python UserList and UserDict classes instead of the abc collections. My concern with that is that we would have to be quite careful to avoid relying on any of the existing behaviors. However, this is also true for using something like MutableMapping. We could use that after just implementing the core abstract methods, but then we run the risk of the mixin methods requiring too many nested calls to sync and load, so we might have to override those anyway. Since we'd be overriding most methods, I think the collections.abc classes are a better choice, but that's very up for debate.
  3. I have not addressed one of my original concerns, the attrdict. I still feel the same way that I originally did, which is that I understand why it's separate but I'm not sure that level of purity is beneficial. I'm open to leaving it as is for now, because I don't think it's central to the core refactoring I'm proposing here. We can always implement it later, either as a mixin or just by adding methods to SyncedDict.

Here's a more complex class diagram including multiple backends (JSON and pickle) and multiple collections (dict and list) to show how that might look. I'm not advocating Pickle as a choice of backends, I just picked it as an obvious one for an example.

extended

Here's the raw UML for both diagrams so we can modify them easily later.

jsondict.txt expanded_example.txt

vyasr commented 4 years ago

@glotzerlab/signac-committers all comments on this would be welcome. If this is your first time reading the thread, feel free to ask any clarifying questions.

bdice commented 4 years ago

This is great! I think this architecture is clear and well-founded. A few other ideas/asides:

  1. In the bigger picture of signac, this architecture seems to assume that Job objects own their own data storage mechanisms (e.g. a Job owns its statepoint, which is a synced dict). In a SQL or other centralized backend, however, it might make more sense for Jobs to route their storage through a single storage backend that is owned by the Project. I am not sure if changes would be needed to support such a backend.
  2. Getting the "attribute dictionary" functionality through a mixin sounds like a good approach to me.
  3. From some reading, I am convinced that using MutableMapping (the abstract base class) is the correct approach, not UserDict. We want to implement the interface of the respective collections, not override specific methods of dict or list.
vyasr commented 4 years ago
  1. This is a really good point. I'm not sure what the best way to support this is. One major question to think about is, what operation should we conceptualize as atomic? In my opinion, an individual SyncedCollection should be able to write atomically. If that is the case, then while you could use a database back-end, you would still need to perform individual transactions for each job. If we agree with that concept, then I think it would be possible to write a SQLCollection that would fit within this framework with save and load appropriately implemented. If we want to be able to collect multiple writes into a single transaction, then things get a lot more complicated.
  2. Mixin is how it's implemented now, so that would be a simple copy-paste.
  3. I'm glad you agree. I think consensus on this point is important before moving forward with an implementation.
vyasr commented 4 years ago

Had a discussion with @csadorf about this, he's largely on board with the stipulation that the focus of the refactoring needs to be on enabling new features (better caching, interchangeable backends) and not on just reworking existing code that already exists. Here were a few specific points raised:

csadorf commented 4 years ago

You need the hook, because you must be able to react to changes of the collection. If you change the SyncedDict, it must inform the Job class that it has changed. Otherwise you can't propagate that information upwards. So in essence, the SyncedCollection class must have a on_change() call back function.

csadorf commented 4 years ago

In addition, I'd recommend that all classes with file access should have a URI that can be used as a namespace for the global cache. So for example, the URI for a SyncedJSONDict would be a UUID5 with namespace signac:SyncedJSONDict and name=file url.

csadorf commented 4 years ago

See also: #189

vyasr commented 4 years ago
  1. In the bigger picture of signac, this architecture seems to assume that Job objects own their own data storage mechanisms (e.g. a Job owns its statepoint, which is a synced dict). In a SQL or other centralized backend, however, it might make more sense for Jobs to route their storage through a single storage backend that is owned by the Project. I am not sure if changes would be needed to support such a backend.

On further thought, I think the proposed infrastructure does support this, but part of the responsibility would be at the signac level, not the SyncedCollection level. Jobs currently own a SyncedAttrDict that is their statepoint. I think the way to support a centralized storage would be to make Project configurable so that (if desired) it could store a centralized SyncedCollection, perhaps with an SQL backend. The Job would then no longer own its own statepoint, but would have to call through to the Project. This method would require tightening the linkage between Job and Project, but we've talked about wanting to do this for signac 2.0 anyway.

vyasr commented 4 years ago

This list should contain all tasks that need to be completed before this issue can be considered closed. We can modify this list if needed, but please cross things out rather than remove them so that we have a documented list of things we considered but chose not to do.

vyasr commented 3 years ago

There are a couple of points that I'd like some discussion on. @glotzerlab/signac-committers I'd appreciate any feedback you have on these.

Pickle/Copy/Deepcopy semantics

Shallow copy

Shallow copies are straightforward, since all attributes are copied by reference.

Deep copying

Currently, some SyncedCollection types support deepcopying, while others do not. The primary distinction made at present is based on whether the backend supports deepcopying its objects (for instance, a pymongo.collection object). In #364, @vishav1771 attempted to implement deepcopy operations for testing, and in order to make this work for backends where true deep copies are not possible he implemented a pseudo-deepcopy operation to be used in testing. This method is internal, so at present its only real use is for testing. The result is that some SyncedCollections can be deepcopied, while others cannot.

I would like to revisit the appropriate semantics for copy (and therefore pickling) semantics with SyncedCollection objects. Currently, signac just puts a warning in the JSONDict class indicating the even deepcopying does not result in a true deep copy because it points to the same file (it also actually advises to use the internal _as_dict method rather than the cal operator, something we should fix in the documentation). The implementation of the _pseudo_deepcopy method for SyncedCollection objects was precipitated by a discussion in which @csadorf pointed out that we should not introduce a flawed deepcopy operation when it is not possible to define a proper deep copy.

By the same logic, I believe that we should disable deep copying for all currently implemented backends since their synchronization with some external persistent data intrinsically prevents a truly deep copy. If we add a backend that involves synchronizing a Python dict with some other dict-like object (see for instance glotzerlab/hoomd-blue#776) then those backends may choose to implement a proper deep copy. I don't think that any of our current backends make sense for this, though. For signac, a deepcopied JSONDict will always point to the same signac_statepoint.json file, so I would prefer that we adhere to the Zen of Python that explicit is better than implicit and require that any object that owns a SyncedCollection (e.g. a signac.Job) either error when attempting a deepcopy, or override its own __deepcopy__ to define the appropriate desirable semantics.

Pickling

Pickling is something of a middle ground. jd2 = pickle.loads(pickle.dumps(jd)) should return basically the equivalent of jd2 = JSONDict(jd.filename): a new object where all nested objects are also new (and any nontrivial resources like pymongo.collection instances have been reinitialized), but pointing to the same underlying data so that all changes made to one object still reflect in the other object via the synchronization mechanism.

Statepoint cache

In #239 @bdice identified some issues with the statepoint cache in the process of trying to implement lazy loading. In general, the current implementation of the statepoint cache is somewhat fragile since it is only modified by open_job and _get_statepoint, so other modifications like reset_statepoint or remove can result in cache states that might be considered invalid (depending on what assumptions other parts of the code are making about the cache). The introduction of the SyncedCollection will also break the current mode of operation of the cache, because the abstraction layer the new infrastructure introduces makes it so that the collection manages the data internally, so it doesn't really make sense for some external object to create a new JSONDict and pass it the data since that requires breaking the abstraction by providing an independent (and possibly invalid) source of truth for the underlying data. For the same reason, I'd be somewhat uncomfortable with doing something like JSONDict(filename, data) since the only way to validate the input data is to load the file.

I have some ideas on how we could address this issue cleanly. For instance, we could a CollectionCache object that could optionally be registered to a SyncedCollection. Since I plan to implement lazy loading inside SyncedCollection objects, the first load could then know to look into the cache. Cache consistency could then be built directly into the SyncedCollection at the highest level, within the save/load methods. Of course we'd need to introduce a mechanism for cache lookup, but that is pretty simple.

Before attempting to make any changes to this part of the code, though, I'd like to get an idea of what others think about this idea or something similar, or more generally what people think about the statepoint cache and how we want to handle it going forward. Getting rid of it would be an unacceptable performance hit, so we need to find some way to fix issues with cache fidelity while retaining the same basic functionality.

bdice commented 3 years ago

Deep copying

I don't see the issue here. In my understanding, a deep copy simply means that nested containers in the deep copy aren't references to the original container's values, but are separate objects in memory. Unless there's some pattern I'm unaware of that is producing a "singleton pattern per filename", there should be no semantic confusion over what a deep copy should mean. Shallow copy is "create a new parent object and share references to all children," while deep copy is "create a new parent object and copy all children recursively by value." The issue here is not with our implementation of synced collections or Python value/reference semantics, it's that we have to set appropriate expectations for what a deep copy should mean to users as in the warning @vyasr linked above. This is because the in-memory representation of data is fundamentally linked to the synced source. That's the entire purpose of this class, so I think the resulting semantics (specifically that modifying a deep copy also modifies the original object via the synchronization with the "source of truth" on disk) are reasonable.

Supposed some_file.json contains {"a": {"b": "signac rocks"}}. Here's the semantics I expect, in a pseudocode example (note that id in CPython is related to the underlying memory address):

d1 = SyncedDict('some_file.json')
shallow_copy_d1 = copy.copy(d1)
deep_copy_d1 = copy.deepcopy(d1)
assert shallow_copy_d1 == d1
assert deep_copy_d1 == d1
assert id(shallow_copy_d1) != id(d1)
assert id(shallow_copy_d1.a) == id(d1.a)  # Nested data is shared by reference in a shallow copy
assert id(deep_copy_d1.a) != id(d1.a)  # Nested data is NOT shared by reference in a deep copy
d2 = SyncedDict('some_file.json')  # A newly constructed object should have the same semantics as a deep copy
assert d1 == d2
assert id(d1) != id(d2)
assert id(d1.a) != id(d2.a)  # Nested data is NOT shared.

Is there anything more nuanced that I'm missing?

bdice commented 3 years ago

State point cache

In the current implementation of #239, the state point cache in Project._sp_cache is purely a lookup table that maps from id to state point. This can't be invalidated, and it is not used as a way to test if a state point or id is valid on disk. In particular, job.reset_statepoint(data) and job.remove() cannot invalidate the data in Project._sp_cache. The "extra" keys belonging to jobs that no longer exist on disk are perfectly fine, as @csadorf explained in this comment and I re-summarized here. I don't think there is a need to create any higher-level caching infrastructure than what already exists in Project._sp_cache.

vyasr commented 3 years ago

Deep copying

I don't see the issue here. In my understanding, a deep copy simply means that nested containers in the deep copy aren't references to the original container's values, but are separate objects in memory. Unless there's some pattern I'm unaware of that is producing a "singleton pattern per filename", there should be no semantic confusion over what a deep copy should mean. Shallow copy is "create a new parent object and share references to all children," while deep copy is "create a new parent object and copy all children recursively by value." The issue here is not with our implementation of synced collections or Python value/reference semantics, it's that we have to set appropriate expectations for what a deep copy should mean to users as in the warning @vyasr linked above. This is because the in-memory representation of data is fundamentally linked to the synced source. That's the entire purpose of this class, so I think the resulting semantics (specifically that modifying a deep copy also modifies the original object via the synchronization with the "source of truth" on disk) are reasonable.

Supposed some_file.json contains {"a": {"b": "signac rocks"}}. Here's the semantics I expect, in a pseudocode example (note that id in CPython is related to the underlying memory address):

d1 = SyncedDict('some_file.json')
shallow_copy_d1 = copy.copy(d1)
deep_copy_d1 = copy.deepcopy(d1)
assert shallow_copy_d1 == d1
assert deep_copy_d1 == d1
assert id(shallow_copy_d1) != id(d1)
assert id(shallow_copy_d1.a) == id(d1.a)  # Nested data is shared by reference in a shallow copy
assert id(deep_copy_d1.a) != id(d1.a)  # Nested data is NOT shared by reference in a deep copy
d2 = SyncedDict('some_file.json')  # A newly constructed object should have the same semantics as a deep copy
assert d1 == d2
assert id(d1) != id(d2)
assert id(d1.a) != id(d2.a)  # Nested data is NOT shared.

Is there anything more nuanced that I'm missing?

@bdice not really, aside from the minor oddities of copying an object that's already nested: d_a = deepcopy(d1['a']) also results in a copy of the object's parent, even though the object shouldn't really own its parent (i.e. you have d1 == d_a._parent and d1 is not d_a._parent). Depending on the level of nesting and size you can end up accidentally copying something large while trying to copy something small, or other similar issues.

I'm mostly discussing this from the perspective that I expect most users who want to deep copy these objects to just do it, see that it succeeds, and then experience unexpected side effects, rather than read the docstring. On second thought I no longer feel so strongly about this need; I'm still certain that this is going to happen, but I'm OK letting it happen since deep copying is a relatively infrequently used feature anyway. We can hope that a user who knows enough to try to deep copy these objects would also recognize the problem fairly quickly, or at least know to read the docstring. Neither of these issues are really a dealbreaker (although these semantics make a deep copy pretty useless), we're just providing more ways for users to shoot themselves in the foot. Trying too hard to prevent that isn't very Pythonic, so it's just a question of where to draw the line. I prefer to play it safer with operations like this where IIRC we originally only recognized this problem after a relatively long attempt at debugging some issue that was caused by an unrecognized reference between a Job and a deep copy of it.

For what it's worth, Zarr.Array supports deecopy, while h5py.File and Python file handles do not.

vyasr commented 3 years ago

State point cache

In the current implementation of #239, the state point cache in Project._sp_cache is purely a lookup table that maps from id to state point. This can't be invalidated, and it is not used as a way to test if a state point or id is valid on disk. In particular, job.reset_statepoint(data) and job.remove() cannot invalidate the data in Project._sp_cache. The "extra" keys belonging to jobs that no longer exist on disk are perfectly fine, as @csadorf explained in this comment and I re-summarized here. I don't think there is a need to create any higher-level caching infrastructure than what already exists in Project._sp_cache.

Yes, I'm familiar with that discussion and the design decisions involved in the current cache, and I agree that the statepoint cache can contain extra jobs that have since been (re)moved, but my understanding was that these statements were no longer true once you started implementing lazy loading. Have you managed to rewrite your implementation so that you no longer have these problems? Or are you simply working around the performance hits that lazy loading incurs when performing validation by optimizing other parts of your code so that the net result is faster?

The main questions that we'll need to address have more to do with the fact that at present, the statepoint in signac is managed by a SyncedAttrDict, which does not control its own data. What I mean by that is that while the JSONDict (which is used for documents) does its own saving to and loading from disk, the statepoint relies on Job.reset_statepoint to do file writing. As a result, it's possible to initialize it with some data, say from the statepoint cache stored in the project, and just promise it that the data is valid and it won't try and read the file. In fact, by setting _sp_save_hook as the parent for all SyncedCollection instances signac circumvents a lot of machinery that conceptually fits better inside the SyncedAttrDict. The statepoint cache is built around this same set of assumptions with respect to when and how the data is actually loaded from disk.

I think this discussion is probably best to revisit once we get to the point of integrating the SyncedCollection classes into signac. At that point we can see to what extent the existing logic needs to be rewritten to take advantage of the new classes, and whether any new functionality is necessary in the new classes to avoid regressions. The suggestions that I made above are IMO the safest way to implement this, but it may be possible to use something much simpler that's good enough.

bdice commented 3 years ago

Yes, I'm familiar with that discussion and the design decisions involved in the current cache, and I agree that the statepoint cache can contain extra jobs that have since been (re)moved, but my understanding was that these statements were no longer true once you started implementing lazy loading. Have you managed to rewrite your implementation so that you no longer have these problems? Or are you simply working around the performance hits that lazy loading incurs when performing validation by optimizing other parts of your code so that the net result is faster?

In both lazy loading and the previous implementation, the project would check its own _sp_cache when attempting to open a job by id, and additionally provide the state point information (thereby avoiding disk access). The only new feature in lazy loading is that jobs won't automatically load their state point data when initialized, and wait until the user accesses the job.statepoint property or calls job.init(). There is also a delayed registration with the owning project's _sp_cache that occurs when lazy-loading and validating the state point from disk. For a simple iteration like [job.sp() for job in project], the performance of both lazy loading and the previous implementation should be identical (the same amount of work has to be done). (Note: Initially there was a performance regression in #239, but I fixed that in #451 and the speed is now slightly faster than before because I also added other unrelated optimizations in #451.)

The main questions that we'll need to address have more to do with the fact that at present, the statepoint in signac is managed by a SyncedAttrDict, which does not control its own data. What I mean by that is that while the JSONDict (which is used for documents) does its own saving to and loading from disk, the statepoint relies on Job.reset_statepoint to do file writing. As a result, it's possible to initialize it with some data, say from the statepoint cache stored in the project, and just promise it that the data is valid and it won't try and read the file. In fact, by setting _sp_save_hook as the parent for all SyncedCollection instances signac circumvents a lot of machinery that conceptually fits better inside the SyncedAttrDict. The statepoint cache is built around this same set of assumptions with respect to when and how the data is actually loaded from disk.

I think this discussion is probably best to revisit once we get to the point of integrating the SyncedCollection classes into signac. At that point we can see to what extent the existing logic needs to be rewritten to take advantage of the new classes, and whether any new functionality is necessary in the new classes to avoid regressions. The suggestions that I made above are IMO the safest way to implement this, but it may be possible to use something much simpler that's good enough.

I agree that we should re-visit this when finalizing the integration with the Job and Project classes. The state point just needs two things that are currently somewhat hacked-in:

cbkerr commented 3 years ago

Along the way to reviewing #453, I’m reading through more conceptual discussions to try to understand SyncedCollection. I like @csadorf’s idea that refactoring should be in the service of new features for users, and I’m trying to understand the benefits of these changes to them.

I see in signac/core/synced_collections/__init__.py the note that all of this is transparent to users, so that’s probably why I’m having trouble seeing how these changes fit in.

# __init__.py Define a framework for synchronized objects implementing the Collection interface.

Synchronization of standard Python data structures with a persistent data store is important for a number of applications. While tools like `h5py` and `zarr` offer dict-like interfaces to underlying files, these APIs serve to provide a familiar wrapper around access patterns specific to these backends. Moreover, these formats are primarily geared towards the provision of high-performance storage for large array-like data. Storage of simpler data types, while possible, is generally more difficult and requires additional work from the user.

Synced collections fills this gap, introducing a new abstract base class that extends `collections.abc.Collection` to add transparent synchronization protocols. The package implements its own versions of standard data structures like dicts and lists, and it offers support for storing these data structures into various data formats. The synchronization mechanism is completely transparent to the user; for example, a `JSONDict` initialized pointing to a particular file can be modified like a normal dict, and all changes will be automatically persisted to a JSON file.

Clarifying my understanding: “synchronization” basically means “keeping the manifest files on disk up to date”?

Is the naming of SyncedCollection meant to imply a relation to signac.Collection?

I need help understanding why users need SyncedCollection. I couldn’t find the genesis of the discussion on it; several issues talk about “improving” (#336, #454) without further detail that I can find. I’m not clear on what the improvements talked about here will bring to my daily use of signac (and by extension, to users who are not developers). @vyasr mentions better caching and interchangeable backends as benefits. Are the user-facing improvements in performance or something more also? Could you expand on this? The answer may be that I would be equally confused about the current implementations involving _SyncedDict, SyncedAttrDict and JSONDict and that SyncedCollection would make all of this less confusing.

From what I understand of the discussion between @vyasr and @bdice (especially talk of users “shooting themselves in the foot”), I also support removing deep copying as it currently exists. Partly bc I don’t know how to answer the question: What would deep copying bring to users who are not developers?

I completely agree that some of this is addressed by the signac 2.0 prototype.

I initially didn’t know where to look but I found it: https://github.com/glotzerlab/signac-2-prototype

vyasr commented 3 years ago

@cbkerr take a deep breath, because there's a lot to unpack here :) "The answer may be that I would be equally confused about the current implementations involving _SyncedDict, SyncedAttrDict and JSONDict and that SyncedCollection would make all of this less confusing." I think this sentence is probably pretty representative of how most signac users and developers (including most of the committer team) feel about these classes right now, so let's orient ourselves by starting with what's currently in signac. Hopefully this explanation will prove useful to other @glotzerlab/signac-committers as well when it comes time to integrate the changes proposed by this issue.

Current behavior of signac

The job statepoint and document are dictionary-like objects that keep a file up-to-date when they are modified. This process of constant synchronization is where the name _SyncedDict comes from in signac. Since it's convenient to be able to do job.sp.foo instead of job.sp['foo'], we implemented the SyncedAttrDict class, which is a very small class that inherits from _SyncedDict and just adds one feature: attribute based access to elements of a _SyncedDict. The statepoint is an instance of SyncedAttrDict.

However, the requirements for the statepoint and the document are slightly different. The statepoint is not supposed to change too much once it is created, so it doesn't need to be super fast, but it does need to be able to "tell" signac to move the job directory in order to maintain the invariant that hash(job.sp) == os.path.basename(job.workspace()) == job.id. On the other hand, the job document could change a lot, either because of the user making lots of changes or because of something like flow using the document for status updates, so it needs to be faster. Conversely, it doesn't require any communication with signac.Job when it changes.

To understand how these distinctions are handled, let's look at the _SyncedDict class. All of the methods that implement dictionary-like behavior (e.g. __getitem__, __setitem__, and so forth) are basically implemented using the same pattern:

def op(self, ...):
    self._synced_load()
    self._data.op(...)
    self._synced_save()
    return ...

Hopefully it's clear from this what's happening: the data of the object is stored in an internal dictionary _data, and every operation proceeds by loading the file from disk into that _data dict, doing whatever operation is requested, then saving the dict back to disk (if necessary, saving is only required if you change the data). If you follow down the chain of calls, you'll see that _synced_load and _synced_save trivially call through to load and save, respectively. These two methods perform a traversal up a hierarchy of "parents", eventually calling the _load and _save methods. The goal of this traversal is simple: if a _SyncedDict is nested in another one (and so on...), only the top-level one actually has all the contents, so it's the only one that actually loads from or saves to the file. Therefore, the _load and _save are what are really responsible for synchronizing with a file.

Now here's the catch: if you actually look at _SyncedDict, you'll notice that the _SyncedDict class doesn't actually implement its _load and _save methods! So how does this class ever actually save anything. The key is to look at how the statepoint is created in a signac.Job: it is always initialized as a SyncedAttrDict with a parent, where the parent is this innocent-looking _sp_save_hook class in signac.Job. If you look at the hook, it basically says that whenever the statepoint is reset, the job should reset its statepoint. As it turns out, statepoints never actually read or write from disk: they just ask the job to do it for them.

Now since documents have different requirements (potentially lots of I/O, but no need to move the job), they need to behave differently. To make job documents faster, we enable buffering: if you ask them to, they'll hold off writing files to disk until after you tell them you've made all the changes that you intend to make. This allows us to make many changes, then tell the document to write them all at once, rather than writing after every small change. This logic is implemented in the JSONDict, which supports buffering of I/O and handles_ its own saving and loading. The JSONDict is a SyncedAttrDict, but it implements the _save and _load methods in order to control its own I/O. It also implements a buffered mode (controlled by the function buffer_reads_writes) in which all JSONDict instances write to an in-memory cache in anticipation of eventually writing all the data to disk in one fell swoop.

You mentioned these three classes in your comment, but there's one that you left out: the _SyncedList class, which is quietly hiding in signac/core/synceddict.py. This class is exactly what you might expect: a synchronized version of a list. However, that's not quite true, because in signac we have no use for a truly synchronized list: we just need the ability to put lists into _SyncedDict objects and have the modifications propagate upward using the parent logic. For example, job.sp.a.b[10] = 20 needs to change the underlying file. The _SyncedList is a minimal class that makes this possible. It is a normal list (it inherits from list), but it modifies a few key methods to enable the appropriate syncing behavior.

Background on and requirements for Collections

If all of these seems confusing, that's because it is. Most of these features were added on an as-needed basis, so things just grew organically. However, over time we started identifying lots of issues. Many of them are smaller bugs that are linked somewhere in this thread. The number of different places that implement different parts of the logic are hard to keep track of, making it very likely for bugs to be fixed in some places and not others; for example, problems with synced lists need to be addressed in very different ways from synced dicts. We also identified limitations that make optimizations difficult, for instance the lazy loading that @bdice has been working to implement. Scalability beyond a certain point is also simply infeasible using a purely file-based approach, so we'd like to be able to use a true database backend, but that's extremely difficult to fit into the existing data model of signac.

There is a Python module collections.abc that encodes various Abstract Base Classes: standard "types" of objects that behave a certain way, irrespective of specifics. For example, lists and tuples are both of type Sequence, because they are ordered sets of things that you can iterate over. However, lists are MutableSequences, whereas tuples are not: x = [1, 2, 3]; x[0] = 1 is valid, while x = (1, 2, 3); x[0] = 1 will raise an Exception because you can't modify tuples. Meanwhile, a dict is a MutableMapping, where a mapping is basically something that behaves like a dict. All of these are examples of a collections.abc.Collection, which is basically just any object that contains other stuff (you can see the exact definitions on the page that I linked).

The proposal in this issue is to try and reduce the duplication and create a clearer separation of responsibilities between different classes so that we avoid repeating ourselves (making us more bug prone) while also creating easier entry points for changing specific behaviors without needing to rewrite the entire functionality from scratch. The basic idea is that what we are trying to implement are various types of collections, all of which share the property of being synchronized with some other resource. Historically this has been a JSON file, but there's no reason that it can't be something else like a SQL database. The abcs described above basically just define specific abstract methods: methods that subclasses have to implement in order to expose necessary functionality. For instance, any Mapping has keys and values and can be accessed via data[key] = value. This issue defines a new SyncedCollection class that is a collections.abc.Collection that is always synchronized with something else.

The conflict with MongoDB collections is an unfortunate coincidence, but under the circumstances both names are appropriate and make more sense than any alternative that avoids using the word Collection.

How SyncedCollections work

Keeping any collection-like object in sync with something else adds two main twists to the standard behavior of collection-like objects. First, we need a way to actually read and write the other data, e.g. by reading a file. Second, we need a way to update the in-memory object based on what's in the file. This second aspect is more complicated than you might think for performance reasons. Recall that we have this complex parent-child relationship set up to make sure that the top-level dictionary in a nested set of them is what syncs to a file. That also means that every time we read from or write to the file, we have to traverse our data structure to keep these linkages intact. Additionally, it's very expensive to just recreate the whole object every single time anything happens, so we need a faster way to update it in place.

The core idea with SyncedCollections is that these two problems are orthogonal: the way that you read and write is dependent on the resource you're synchronizing with (a file, a database, another in-memory object, etc), while the way that perform in-place updates is dependent on the type of the data (a dict, a list, etc). Therefore, we should in principle be able to decouple these in a way that makes it possible to mix and match freely. If you go to the feature/synced_collections branch and navigate to signac/core/synced_collections, you'll see how this works in practice. The SyncedCollection class defines a handful of abstract methods that children have to implement. Then, while one set of subclasses implement data structures (synced dicts, synced lists, etc), another set of subclasses implement different backends (JSON file, MongoDB database, etc). Each of these is still abstract, because they only have half the picture. However, they can be freely combined to make fully-fledged classes using inheritance! For example:

class JSONDict(JSONCollection, SyncedAttrDict):
    pass

is effectively enough to define a new class (there's a tiny bit of extra logic I'm hiding that makes constructors play nice with multiple inheritance, but it's trivial code). Replacing JSONCollection with MongoDBCollection gives you a dict that synchronizes with a MongoDB database, and so on. Moreover, it turns out to also be possible to implement different types of buffering behavior that is similarly interchangeable, although I won't go into that in this post since it's already a novel.

Summary

Hopefully this super long explanation is helpful for you to understand both where we've been and what we're trying to achieve. The classes currently in signac for statepoints and documents have had various features tacked onto them over time to improve performance and functionality, but they are built around JSON and in the case of the statepoint are also deeply intertwined with the functionality of other core signac core classes. This structure made it pretty difficult to keep up with complex bugs that could manifest differently in statepoint and job documents, or bugs in lists that didn't occur in dicts, and it also made it very difficult to improve performance, either using different caching methods or by employing something more performant than JSON files as a backend. The new classes aim to solve this by separating these different components into different classes that can be easily mixed and matched to suit our needs in different scenarios, hopefully while also making it easier to maintain by making it obvious where to look when something breaks.

bdice commented 3 years ago

@vyasr Now that #472 is nearly done, it would be helpful if you could open a tracking PR for feature/synced_collections that documents what's left to do and is linked to this issue and all other related open issues. Even if it just remains a draft PR for the near future, it would help with project management and wrapping this up.

vyasr commented 3 years ago

@bdice I merged #472. There is only one more PR I plan to make for this, which will reorganize the package structure, add some docs, and do some misc cleanup. After that I'll open the PR you requested, since at that point I don't think it will even need to be a draft.

vyasr commented 3 years ago

@bdice PR is open at #484 as requested.