hypothesis / support-legacy

a place for tracking support-related work and projects
3 stars 0 forks source link

Spike: PDF has selectable text in PDF.js in Firefox and works as expected, but has very little selectable text in the LMS app #188

Open mkdir-washington-edu opened 3 years ago

mkdir-washington-edu commented 3 years ago

Desired outcome of the Spike is:

Note: it is possible to run a version of pdf.js locally which matches the version of pdf.js that we serve with the LMS app. It's beyond the capabilities of the support team, though.


Describe the bug A user has provided a PDF that is fully selectable in PDF.js in Firefox, but has a broken text layer (very little selectable text, sporadically arranged throughout the page) when viewed in the LMS app.

Select all in Firefox on the first page: image

Select all in the LMS app on the first page: image

To Reproduce Steps to reproduce the behavior:

  1. Download PDF attached to this issue
  2. Open in Firefox
  3. Select all text
  4. Log in to the Hypothesis Canvas instance and navigate to https://hypothesis.instructure.com/courses/92/assignments/1583
  5. Open assignment
  6. Select all text
  7. Note difference in selectable text.

Expected behavior While there is occasionally a difference in the selectable text available in Firefox and Chrome, in the past Firefox has been a good way for instructors to test a PDF before trying it out in the LMS app.

Screenshots Firefox console: image

LMS app Console in Chrome: image

Desktop (please complete the following information):

PDF file Clare Goll & DH Lawrence combined for Hypothesis.pdf

mkdir-washington-edu commented 3 years ago

I tried to force OCR this document using the docdrop OCR tool to see if the resulting document had similar issues. It's been 30 minutes and docdrop hasn't finished processing the file.

I will add the file and the result of tests in a comment once I am able.

mkdir-washington-edu commented 3 years ago

Seems like no selectable text in Safari 13.0.5.

All text is selectable in Chrome 89.0. FYI we've seen PDFs in the past that had selectable text in Chrome but not in PDF.js (both Firefox and the LMS app), which is why we typically test PDFs in Firefox.

Chrome selectable text: image

mkdir-washington-edu commented 3 years ago

DocDrop Force OCR option doesn't work on this file.

Exporting the file to image files, recombining them to a PDF and the OCRing does work; the selectable text is present in Firefox and the LMS app. However, this isn't a useful solution for users.

klemay commented 3 years ago

Added to our bug & product backlog as a Spike - a good outcome of that Spike would be:

mkdir-washington-edu commented 3 years ago

Another problem PDF according to the same instructor should you need more examples. Offen and Steinbach combined for hypothesis assignment.pdf

mkdir-washington-edu commented 3 years ago

And here's an example with the added "Read here" text in red that does work properly in both Firefox and the LMS app, in case a comparison is needed. Combined pr sources - nuremberg & mass shooting.pdf