hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

PDF text layer garbled in PDF.js #1277

Closed mattdricker closed 2 years ago

mattdricker commented 3 years ago

From user ticket: https://app.hubspot.com/contacts/6291320/ticket/573368454/

User reported garbled text quoted when selecting content from a specific to annotate.

Screen Shot 2021-09-24 at 10 55 19 AM

Confirmed that using PDF.js to select text in document -- both with Hypothesis client and natively in Firefox -- results in garbled output.

Using other PDF viewers (MacOS Preview, Adobe Acrobat, Chrome browser) to select text copies clean and ungarbled.

Running PDF through OCR again fixes issue.

This may be a fluke or extremely rare edge case, but it may be worth investigating why PDF.js behaves differently than other PDF viewers in rendering the text layer.

Example PDF (Hypothesis staff only): https://drive.google.com/file/d/1qG9Ea5D3lVGNUrhsLCn_fB9BpX8MFnoN/view?usp=sharing

mattdricker commented 3 years ago

Here is a one-page excerpt PDF of the problem text:

Freakonomics_page10.pdf

mattdricker commented 2 years ago

Investigating this again a year later I find that PDF.js no longer garbles the text layer. Likely fixed with an updated version.

Closing issue.