hypothesis / h

Annotate with anyone, anywhere.
https://hypothes.is/
BSD 2-Clause "Simplified" License
2.94k stars 427 forks source link

Very poor performance anchoring text quotes #2895

Closed robertknight closed 7 years ago

robertknight commented 8 years ago

When viewing this ClimateFeedback page, my browser ends up becoming minimally responsive for nearly 40s during attempts to anchor quotes: http://www.wsj.com/articles/the-climate-snow-job-1453664732 . Most of the time is spent in diff-match-patch functions in the toPositionAnchor() implementation:

firefox-dev-tools-profiler

On this page most of the text is hidden because I'm not signed in, so most of the quotes will fail to anchor.

Archive of web page content as I see it plus the full Firefox profiler JSON output: https://www.dropbox.com/s/6jqfrm666cfvt4c/wsj-page-content-plus-Firefox-dev-tools-profile.tar.gz?dl=0

robertknight commented 8 years ago

For reference, the documentation explaining what match_main does is at https://hypothes.is/a/AVKCMpsevTW_3w8LyjD2 .

tilgovi commented 8 years ago

Let me know if you want to discuss any time. I'm very familiar with the sources of slowness in diff-match-patch, potential mitigations, and some ideas for replacements.

robertknight commented 8 years ago

@tilgovi - Yes, I'd appreciate your thoughts. I haven't researched the problem and current solution in much depth yet. In addition to the performance issues, diff-match-patch is a large, non-modular library and we're currently only using a very small part of it.

chdorner commented 8 years ago

This seems to be much better now: 2016-08-05-171307_937x501_scrot

Is it worth to still keep this open?

tilgovi commented 8 years ago

Up to you all, but just so I don't forget I'll write this down briefly:

The diff-match-patch bitap implementation performs a synchronous pass over the whole text and returns the best match for a single search pattern. A better algorithm could increase performance dramatically for cases where several annotations fail to match using a simpler method.

There are several performance enhancements I've considered were I to re-implement it:

The first and last of those may be sort of mutually exclusive, unfortunately.

Another change that would almost certainly help Hypothesis would be if the algorithm treated distance and similarity independently. The way the Google implementation works it can be hard to find a sweet spot that tolerates big moves without tolerating big content changes.

nickstenning commented 7 years ago

I'm going to close this issue, not because all our performance issues are "fixed" -- they're very definitely not -- but because things are empirically "good enough" in most circumstances that this issue (despite being labelled P2) isn't actually resulting in anything happening.