PDF quote selectors can capture text that is not part of the PDF

robertknight commented 3 years ago

I noticed some TextQuoteSelector selectors in my local h database that had "Loading annotations..." in the prefix or suffix fields. This text is not part of the document but a placeholder DOM element that the client creates in un-rendered PDF pages to serve as a temporary location to attach highlights to until the page is actually rendered.

Looking at the way that quote selectors are created for PDFs, it is possible that they can include text which is not part of a rendered PDF page: https://github.com/hypothesis/client/blob/b1163c4f7e4eb19c5af0f37beaadc7c886551f69/src/annotator/anchoring/pdf.js#L569

It seems that there is a "buffer" of whitespace in the DOM before and after the text layer elements in the PDF, so that is what gets captured as the prefix if you annotate text at the start of the PDF or the suffix if you annotate text at the end. You can see this if you annotate the "Dummy PDF file" text on https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf.

robertknight commented 3 years ago

We could fix this by limiting the quote selector context to only contain text from the same page as the quote. This would avoid capturing empty whitespace as context if annotating the start or end of the PDF. However it could mean that less useful quotes are captured for pages in the middle of the document. For example, if the user annotates text at the end of a page, where the page ends in the middle of a sentence, it would be ideal if the context captured the continuation of the sentence from the next page.

I think ideally we still want quote selectors to be able to capture context from pages before or after, but not capture any text from elsewhere in the DOM. For un-rendered pages we want to capture the page text (requested from PDF.js) rather than the placeholder content in the text layer.

robertknight commented 3 years ago

For example, if the user annotates text at the end of a page, where the page ends in the middle of a sentence, it would be ideal if the context captured the continuation of the sentence from the next page.

In practice, although quotes captured from the start or end of a page sometimes capture the text that logically comes before or after the selection from the previous or next page, this often doesn't happen for several reasons:

Pages may contain headers, footers etc. which appear in-between the body of the text
The order of text items in a page returned by PDF.js often does not match what a reader might expect. My guess is that the order will be whatever order the text items appeared in the source document. The ordering is usually what you'd expect within a sentence or paragraph, but less so across distinct parts of a page (eg. between paragraphs of content and figure captions or headers).

On this basis it might be better to restrict the context captured for quotes to the same page as the quote. This would also make quote selector generation symmetric with anchoring, since when anchoring we only search the text of one page at a time.

hypothesis / client

PDF quote selectors can capture text that is not part of the PDF #3737