Closed robertknight closed 7 years ago
For reference, the documentation explaining what match_main
does is at https://hypothes.is/a/AVKCMpsevTW_3w8LyjD2 .
Let me know if you want to discuss any time. I'm very familiar with the sources of slowness in diff-match-patch, potential mitigations, and some ideas for replacements.
@tilgovi - Yes, I'd appreciate your thoughts. I haven't researched the problem and current solution in much depth yet. In addition to the performance issues, diff-match-patch is a large, non-modular library and we're currently only using a very small part of it.
This seems to be much better now:
Is it worth to still keep this open?
Up to you all, but just so I don't forget I'll write this down briefly:
The diff-match-patch bitap implementation performs a synchronous pass over the whole text and returns the best match for a single search pattern. A better algorithm could increase performance dramatically for cases where several annotations fail to match using a simpler method.
There are several performance enhancements I've considered were I to re-implement it:
The first and last of those may be sort of mutually exclusive, unfortunately.
Another change that would almost certainly help Hypothesis would be if the algorithm treated distance and similarity independently. The way the Google implementation works it can be hard to find a sweet spot that tolerates big moves without tolerating big content changes.
I'm going to close this issue, not because all our performance issues are "fixed" -- they're very definitely not -- but because things are empirically "good enough" in most circumstances that this issue (despite being labelled P2) isn't actually resulting in anything happening.
When viewing this ClimateFeedback page, my browser ends up becoming minimally responsive for nearly 40s during attempts to anchor quotes: http://www.wsj.com/articles/the-climate-snow-job-1453664732 . Most of the time is spent in diff-match-patch functions in the
toPositionAnchor()
implementation:On this page most of the text is hidden because I'm not signed in, so most of the quotes will fail to anchor.
Archive of web page content as I see it plus the full Firefox profiler JSON output: https://www.dropbox.com/s/6jqfrm666cfvt4c/wsj-page-content-plus-Firefox-dev-tools-profile.tar.gz?dl=0