synchrony / smsn

Semantic Synchrony. An experiment in cognitive and sensory augmentation.
Other
179 stars 15 forks source link

Incremental synchronization of elements to files #66

Open joshsh opened 1 year ago

joshsh commented 1 year ago

There are currently several ways to serialize SmSn graphs to the file system, but the most important for everyday use is the so-called VCS serialization. Every atom corresponds to a file, in a directory corresponding to the logical datasource associated with the atom. Because every representation in SmSn is a set of atoms, this results in a huge number of files. However, a bigger problem is the fact that the entire graph must be synced with the file system at once. This takes significant time (several minutes), and requires the user to stop what they are doing and very consciously attend to the synchronization process. It's a major barrier for adoption, especially vis-a-vis solutions like Org-mode which sync to the file system directly. It is also somewhat of a liability to use Neo4j as the source of truth for SmSn data in between sync operations. I have personally experienced major data loss and corruption when Neo4j silently failed for some reason or other, and too much time elapsed between syncs.

Fixing this problem should actually result in a much simpler solution than the current one. Going forward, there will be a configurable source of truth which could be Neo4j or another TinkerPop-enabled graph DB, but also could be another data store such as the file system. The latter will be the default, and the former might be added again later (SmSn is not up to date with recent versions of Neo4j). No bulk sync operations will be necessary when the file system is the SoT, and the user will be free to place the data directory under version control using a solution of their choice. Much as we do now, we will provide a starter kit using Git as the version control solution.

cc @jmatsushita

joshsh commented 1 year ago

Note: as SmSn graphs are typically small enough to fit in memory, an in-memory cache of the user's graph will be available on the server. When elements of the graph are updated, the update will be placed in a queue to update the file system, but subsequent reads will not block on the file system update. The usual considerations apply with respect to concurrency and consistency. A bulk refresh-from-disk operation will still need to be supported, but this can be more efficient than the current bulk read if we store a timestamp with each element (and read an element's file only if the timestamp has changed).