editscript composition and optimization

juji-io / editscript

A library to diff and patch Clojure/ClojureScript data structures

Eclipse Public License 1.0

485 stars 23 forks source link

editscript composition and optimization #2

Open EmergentBehavior opened 6 years ago

EmergentBehavior commented 6 years ago

First, I think this library is pretty interesting. I was wondering about one use case though: let's say you have entity A_t0 (where t is analogous to a time step) and you have an editscript e_0->1 to describe the transformation needed to get A_t0 to A_t1. If you capture an editscript for transformations at each time step (if there is a change), you'd have a collection of e, right? Then if you want to get the present state of A you could just concatenate all those editscripts together (to describe changes between t0 and tN). Have you tried this use case?

I wonder if at some point though, if the editscript gets large enough the patching process would slow down and it would be helpful to have some sort of editscript optimizer to reduce to the minimal editscript needed to get from At0 to AtN.

huahaiy commented 6 years ago

For your first paragraph, yes, the editscript is designed to do just that. (get-edits e) return a vector. These vectors can be concatenated to represent a larger change. BTW, I added a 'combine` function.

For the second, it is a very interesting question. I have not encountered the cases where the patching process take too long. When these cases do appear, I will think about an optimizer.

On the other hand, editscript is designed with stream processing in mind. An editscript should be conceptualized as a chunk in a potentially endless stream of changes. So it is more meaningful to worry about data integrity, compression, windowing, etc, rather than the sizes of individual ediscripts. Optimizers in these contexts are indeed what I am very interested in.

Basically, I consider editscript as a part of the data-oriented effort of Clojure, that tries to elevate the level of abstraction of data from characters or bytes level to that of maps, sets, vectors, and lists level. So instead of talking about byte streams, we can talk about change streams in term of these data structures.

Do I make sense?

pepe commented 6 years ago

I haven't had a chance to try edit script yet, but I think it will play nice with Specter. It seems to me they have a similar view of the data.

EmergentBehavior commented 6 years ago

@huahaiy Thanks for the answer. My latter paragraph was considering a scenario in event streaming where I rebuild the "present" version of an entity by composing all historical mutations over its entire history of existence (if checkpointing or other strategies weren't used).

huahaiy commented 6 years ago

@EmergentBehavior You scenario sounds similar to mine.

Given an editscript, there are indeed some opportunities to optimize, e.g. if one sub-tree will later be deleted, all edits happened inside that sub-tree could be safely removed without impacting the end results.

Such optimization may require the editscript to record some kind of identifiers for internal nodes. I will think about these.

Meanwhile, my current focus is to further improve the diffing speed. I am working on fingerprinting the data to avoid drill down sub-trees that have the same content.

huahaiy commented 4 years ago

Implementing some obvious optimizations should be a good starting point.