juji-io / editscript

A library to diff and patch Clojure/ClojureScript data structures
Eclipse Public License 1.0
472 stars 23 forks source link

Long runtime for the quick algorithm on some large data structures #29

Closed sigvesn closed 2 years ago

sigvesn commented 2 years ago

At microdata.no we use editscript with the :quick diff algorithm to generate diff's between potentially large edn structures.

This usually works very well, but i have encountered a special case where the editscript/diff with the :quick algorithm is very slow.

It is hard to pinpoint exactly what it is in the data that causes this, as the compared files are about 2.4MB each, but here is a repo with two example files (which also contains code to run the diff):

https://github.com/sigvesn/editscript-quickdiff-example

diff using either the "non-quick" algorithm (or removing the :active field from the maps in old) runs in ~0.2-1 second

Using :quick true results in a runtime of ~7.5 minutes

huahaiy commented 2 years ago

Interesting, I will take a look.

sigvesn commented 2 years ago

Can the problem be connected to the fact that editscript seems to not be able to generate a diff at all, but instead returns the whole file as a replace action?

This is seen in the example output file from the diff generated by https://github.com/sigvesn/editscript-quickdiff-example/blob/main/src/main.clj#L19 at https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff.json

I compared this against a diff created by a different tool: https://www.npmjs.com/package/small-json-diff created for the same files: https://github.com/sigvesn/editscript-quickdiff-example/blob/main/out/diff_small_json.json . This diff shows the expected changes as i described above

huahaiy commented 2 years ago

I can tell you what I know so far:

A piece of advise: unless strictly necessary, I advise against a list or a vector of maps as a data structure. The reason is that a list or a vector implies order, but order is an expensive thing to keep when diffing. If the use case does not call for order, why not use a set? EDN introduces sets for good reason.

huahaiy commented 2 years ago

There's not much I can do at this point. The quick algorithm compare things blindly so it attempted to compare two huge strings, which is going to cost a lot of time. My advise is to use the default and do not compare strings in your case. Because you have some huge strings. Since your use case is to sync client/server state, I really don't see why you would like to compare huge strings like that. It costs a lot more to compare strings than just sending the whole thing over the network.

The A* algorithm did the right thing. If you insist on getting what you think you should be getting, you will have to use something else. Editscript is not designed for such huge changes, we intend it to extract small changes that are much smaller than the original data. In the example you show, the changes will be bigger than the original data using our diff format. In such case, overwriting the data is the right thing to do.

Closing.

huahaiy commented 2 years ago

On a second thought, I can introduce a time-out for vector diff, so we can skip the computation if time is out.

Another option is to diff strings by lines instead of by characters. That should speed up things significantly.

These changes will be introduced in 0.6.0 release.

sigvesn commented 2 years ago

Not using string diffing seems like a good solution for us. A time-out would also be welcome.

Thank you very much for the help and advice!

huahaiy commented 2 years ago

0.6.1 now allows controlling string diffing, and added a timeout option.

Using :str-diff :line option, running quick diff algorithm on your data took only 570ms on my machine. Using :str-diff :word instead, some string diff will time out, but the total time took 4.8s on my machine.

sigvesn commented 2 years ago

The new :str-diff options and timeout seems to be working great. Some small notices:

huahaiy commented 2 years ago

Thank you for testing. These issues are addressed in 0.6.2.