Quote anchoring blocks execution for a long time

sneakers-the-rat commented 2 years ago

Summary of investigation (2021-11-22) - fuzzy quote anchoring can be very inefficient in long documents for short, generic quotes. In this case the quotes include single punctuation marks and whitespace characters. See https://github.com/hypothesis/client/issues/3919#issuecomment-974871178 - @robertknight

Steps to reproduce

this is on a semi-private page that I can share if it would help, but i think this would happen on any page that has a) a lot of text, b) a lot of annotations, and c) the text has been edited so the anchors don't have exact matches.

Expected behaviour

hypothes.is should load asynchronously and not block page execution while it is loading

Actual behaviour

The page will partially load and the exact matches will be drawn, but then every other annotation without an exact match for an anchor will be resolved serially while blocking page execution. This means that the page will render the first screen, but then be completely unresponsive for >10 seconds.

20% of the time is network requests, and 60% seems to be the function to resolve imperfect matches.

The problem seems to be here: https://github.com/hypothesis/client/blob/63545fd3c056740c393b4aa14d27cbf0bd8ec12d/src/annotator/anchoring/html.js#L83

I don't know much about promises/async, and I haven't dug through the code to say for sure, but it looks like the problem might be what look like nested promises, and it also doesn't look like the async querySelector is being awaited? In any case something is not working with the async implementation there and i unfortunately have to disable it on my page until i find a workaround :(

Browser/system information

Hypothesis version 1.911.0 Browser: Firefox 94.0.1 and Chrome 95.0.4638.69 OS: macOS 11.6

Thanks in advance, love y'all

robertknight commented 2 years ago

If you can share the text of the page and the annotations (or just the target field from them) that would be very helpful. The function names have been lost in the trace above, but by feeding the Hypothesis v1.911.0 annotator.bundle.js source through a code formatter I can recognize the De function as the findMatchEnds function from the approx-string-match library. This is indeed involved in resolving imperfect matches.

We have tested this with pages that have a lot of annotations and a lot of edits (eg. http://www.americanyawp.com/text/01-the-new-world/), so we'd need to dive into the specifics to find out exactly what is happening.

sneakers-the-rat commented 2 years ago

well, i guess it's getting close enough to public that I might as well share the page: https://jon-e.net/infrastructure/versions/a9e9970/

(definitely don't doubt the depth of testing)

edit: updated with permalink

robertknight commented 2 years ago

Can you confirm group were you looking at when you recorded the trace above? I'm guessing it is the Public group but I want to make sure.

In the Public group, where there are 76 annotations (of which 3 fail to anchor), I see the following result on my system when starting the JavaScript Profiler in Chrome 97 just after refreshing that page and stopping once the indicator showing the location/number of annotations on the page appears and the count stabilizes (at ~72):

The subjective user experience I see fairly closely matches this in Chrome. Shortly after the page loads the main thread is blocked for a couple of seconds.

In Firefox 94 the duration of the lag is much longer:

In Safari 15.1 the duration I see is similar to Chrome:

The overall number of annotations here is not that large, though the text is quite long (~271K chars, ~41K words).

sneakers-the-rat commented 2 years ago

Yes, looking at the public group! Thanks for taking the time to take a look. Supporting Firefox will be important because a number of the people that will be interested in this document will be very much "free internet" people. I'm happy to do a workaround if this is an edge case, but i tried to wrap the hypothes.is load in a separate async method and couldn't quite get it to work (bad at js).

fwiw there are a number of "false positive" anchors that just pick some matching text since the original text has been edited/removed -- which is completely understandable of course since anchoring in changing documents is hard! Maybe put that in the "nice to have" pile, being able to mark annotations as having lost their anchors, but realize that's part of a larger question about 'who gets to moderate hypothes.is'

robertknight commented 2 years ago

I'm happy to do a workaround if this is an edge case, but i tried to wrap the hypothes.is load in a separate async method and couldn't quite get it to work (bad at js).

Unfortunately deferring the loading of Hypothesis won't really help here as the problem is just that too much continuous work is being done on the main thread. The fix will likely involve some combination of:

Making the search process more efficient
Dividing the search into smaller steps with pauses in between
Moving parts of the search into a background thread so it doesn't block the UI

sneakers-the-rat commented 2 years ago

I'd be happy to try and help with maybe a few pointers to where one might implement those strategies^

seems technically challenging, as, presumably you have already done the optimizations that are near at hand
makes sense to me, especially if it's possible to prioritize anchors that are near to the current viewing frame of the reader. I think this would just mean putting a sleep statement somewhere in the loop?
I don't know much about threading in javascript but this seems like the best option to me depending on how difficult it is to make background threads.

I also wonder if it would be possible to add a configuration option that lets embedding sites choose which kind of selectors to use, as i see all three are stored for each annotation? For example, in a page that is undergoing lots of expected editing it might be good to disable TextQuoteSelector to purposefully set a lower threshold to orphaning. In the case of this document, for example, a pretty sizeable proportion of the anchors that are placed are incorrect and would be better off orphaned anyway (eg. one reader making lots of annotations on single words, and when those are moved/edited the annotation just goes and latches onto another time that word is used)

It also seems reasonable to let people parameterize the search, I see a few points that look straightforward to implement:

setting the maximum number of errors: https://github.com/hypothesis/client/blob/63545fd3c056740c393b4aa14d27cbf0bd8ec12d/src/annotator/anchoring/match-quote.js#L98
disabling the approximate search altogether: https://github.com/hypothesis/client/blob/63545fd3c056740c393b4aa14d27cbf0bd8ec12d/src/annotator/anchoring/match-quote.js#L46
it seems like this could also be refactored to terminate search (at least the exact search) upon reaching some minimum score?: https://github.com/hypothesis/client/blob/63545fd3c056740c393b4aa14d27cbf0bd8ec12d/src/annotator/anchoring/match-quote.js#L155

It seems like that would be user-overridable if they really wanted to, right? They could always click the bookmarklet to reload the original overlay?

sneakers-the-rat commented 2 years ago

confirmed that doing this:

https://github.com/sneakers-the-rat/client/commit/91604ef257e8fe363414e2c45e6a8f8bb2224647

and excluding the approx search avoids maxing out the cpu for ~5s (on chrome), there is no detectable hang while browsing the page. More annotations are correctly orphaned, but there are some that are near-matches that should be there, so it seems like letting the user provide maxErrors (maybe either as an int or as an anonymous function that takes the quote, as is done there?) would be a good idea.

sneakers-the-rat commented 2 years ago

If you let me know if this is something y'all would be interested in and point me to maybe an example of how a comparable configuration field is implemented i'd be happy to draft a PR. at the moment I'm having to message people individually and ask them to delete their old comments that don't load, it would be nice to be able to have some workaround soon bc i'm going to launch this paper soon and otherwise would have to disable hypothes.is bc it's too laggy :(

sneakers-the-rat commented 2 years ago

I think the lowest impact fix might be to wrap https://github.com/hypothesis/client/blob/0c0a639b88d26da564a1d55a0aee85f14ec647a1/src/annotator/anchoring/html.js#L28 in a setTimeout call like

export function anchor(root, selectors, options = {}) {
setTimeout(() => {
 // body of anchor function
}, 0)

which should put each call within the map function at the bottom of execution queue and allow time between each execution to re-render the page? haven't tested yet, but can if it would be an acceptable fix if it worked?

robertknight commented 2 years ago

Looking at the matches for the 78 current annotations on this page, it looks like part of the problem relates to annotations which have short, generic quotes that occur many times in the text.

Below are the quotes and match counts, assuming a maximum of quote.length / 2 errors, for the 20 quotes with the largest number of matches:

[
  [ ' ', 41271 ],         [ '.', 2274 ],
  [ '.', 2274 ],          [ '.', 2274 ],
  [ '.', 2274 ],          [ '.', 2274 ],
  [ 'I', 590 ],           [ '!!', 87 ],
  [ '!! ', 78 ],          [ 'This', 70 ],
  [ 'both', 25 ],         [ 'some of', 19 ],
  [ 'IP', 18 ],           [ 'don’t exist', 14 ],
  [ 'semantic web', 12 ], [ 'export', 9 ],
  [ 'as much', 8 ],       [ 'implemented', 7 ],
  [ 'nightmarish', 6 ],   [ 'capitalism', 6 ]
]

Ranking the quote matches involves computing a score for the similarity of the prefix and suffix of each candidate to the expected prefix/suffix. I believe this is what generates most of the approximate-string matching calls, not the initial search for the small number of quotes. This is going to be pretty expensive compared to if we had included the context of the quote in the initial search.

If all the above is correct, the most widely applicable improvement we could make here would be to handle these short, generic quotes in a better way.

robertknight commented 2 years ago

Aside from fixing the performance problems, there are also some general UX issues with very short quotes (eg. punctuation marks) at the moment:

You can't tell what the annotation in the sidebar is about without the document available. It would be helpful if the context (ie. surrounding characters) were displayed in the annotation quote, with some emphasis to make it clear which text was annotated and which text is just context. For debugging I added the prefix and suffix to the quote and added an underline below the actual quoted text:
When scrolling to a short quote, such as a punctuation mark, it is hard to see exactly which part of the document is being referred to because the highlighted quote is so small. A more visible cue would be helpful in this case.

sneakers-the-rat commented 2 years ago

checking in again, asking what the best way for me to help would be, am totally happy to draft a PR if you let me know which would be acceptable, several ideas listed above fix the problem^^.

getting ready to publish, there's a whole section on hypothes.is so would love to have it present on the document, but currently unusable :( https://jon-e.net/infrastructure/#annotation--overlays

edit: seems like there are now so many unanchorable annotations that the client just crashes on mobile. definitely need either a way of disabling/increasing threshold of fuzzy matches or else ability to delete comments. I've asked folks to delete them but they have other stuff to do, would love to contribute a PR if you'd let me know which strategies would be acceptable -- I think adding config option for fuzzy matches would be the most straightforward way and would only affect explicit embeds

robertknight commented 1 year ago

Related to this issue, we're about to make a change so that the client won't even allow the creation of annotations where the quote consists entirely of whitespace characters. This would have prevented the creation of the annotation with the largest number of quote matches in https://github.com/hypothesis/client/issues/3919#issuecomment-974871178, although such annotations can still be created via the API.

sneakers-the-rat commented 1 year ago

This is fair, but it also is still the case that it doesn't address the general problem of how the execution is synchronous and blocks the rendering of the page in the case those annotations do exist - and so beyond whitespace, it should be generally true that some documents somewhere will have a lot of badly matched annotations and seems like something worth addressing directly. So something like creating workers in new threads as mentioned in https://github.com/hypothesis/client/issues/3919#issuecomment-970073630 , or even something that seems like a low-rent way of executing asynchronously might be doing this: https://github.com/hypothesis/client/issues/3919#issuecomment-974761117 .

That, and the matching is also still too eager and incorrectly matches (rather than correctly orphaning) too many annotations, which seems like something that could be resolved by exposing some of the configuration parameters for matching would be a code-light way to let people resolve those problems as needed as in https://github.com/hypothesis/client/issues/3919#issuecomment-970714508

Want to reiterate that i'm more than happy to help draft the PR if y'all would let me know which of those strategies you would accept!

robertknight commented 1 year ago

it should be generally true that some documents somewhere will have a lot of badly matched annotations and seems like something worth addressing directly. So something like creating workers in new threads as mentioned in https://github.com/hypothesis/client/issues/3919#issuecomment-970073630 , or even something that seems like a low-rent way of executing asynchronously might be doing this: https://github.com/hypothesis/client/issues/3919#issuecomment-974761117 .

The problem you ran into is pathological behavior with short quotes that have a very large number (thousands) of exact matches, as explained in https://github.com/hypothesis/client/issues/3919#issuecomment-974871178. I'm going to scope this issue to resolving that problem specifically. That doesn't preclude more general changes to make anchoring non-blocking as part of other work.

mattdricker commented 1 year ago

Per request by @robertknight in https://github.com/hypothesis/support/issues/37 adding here another example of this issue at https://ofce.github.io/logsoc/synthese.html

sneakers-the-rat commented 1 year ago

again suggesting that it seems like there are several strategies to resolving this bug by exposing additional configuration options that shouldn't degrade service in other ways: https://github.com/hypothesis/client/issues/3919#issuecomment-970714508

making the approximate search non-blocking might be complicated, but exposing an option to disable it or tweak its thresholds wouldn't be! this would expand possible uses for the tool where approximate matches might not be desirable at all.

I appreciate how y'all have a roadmap and are working on a related problem elsewhere, but in this case the bug is now making 3 documents of mine unusable where if i got the OK i could have submitted a 10-line patch a year and a half ago and fixed it.

For the benefit of others that might be experiencing this bug, doing this at least makes it so the page does an initial draw before freezing: https://github.com/sneakers-the-rat/surveillance-graphs/blob/main/assets/js/hypothesis.js

hypothesis / client