Indicate on document selection screen that PDFs must have selectable text

klemay commented 4 years ago

Background

In a discussion of edu/LMS support tickets that have come in since September, PDF issues (specifically, instructors trying to upload and annotate PDFs that don't have selectable text) were among the most common.

We have a KB article that explains how to ensure PDFs have selectable text and we do mention this while onboarding partners. However, we still get a high volume of support requests around this. In addition to a "cultural" fix, we'd like to employ a product-based solution.

User story

As an instructor using the Hypothesis LMS app with my students, I'm not familiar with the concept of OCR for PDFs, and I don't necessarily know to check whether my PDFs have selectable text before creating a Hypothesis-enabled assignment.

Brainstorming

One way to approach this could be to add something to our assignment creation screen:

Something to the effect of: "PDFs must have selectable text..." with a link to our KB article.

This would cut down on the number of instructors creating assignments with non-annotate-able PDFs.

dwhly commented 4 years ago

I think a message to that effect is definitely appropriate.

However, I'd suggest one step further-- namely that we have a process which once they upload the PDF, we download it and run a quick test on it to see whether there is any text at all in the semantic layer of the PDF. This is a very quick and easy test to do. docdrop.org does it for instance as a way to determine whether it should auto-OCR a given PDF. If the answer is "no" then we should return (maybe even immediately) to the user an alert that there is no selectable text with an advisement on how to correct.

One benefit of this is that we'd instantly be able to gather data on the prevalence of PDFs w/ no text layer.

It's possible that we could do what docdrop does and use tesseract to auto-OCR, but unfortunately, the quality of tesseract (the most popular open source OCR packaga) is quite poor. Both Acrobat PRO and Abbyy do much better. We might create more problems for ourselves short term if we were to automatically do a crappy OCR job-- though that's possibly debatable. Long term, we should probably implement Abbyy or another similar programmatic solution, even if it's closed source. Would definitely be something that would add to our value chain.

klemay commented 4 years ago

@dwhly I really like the idea of scanning a PDF upon upload. Even without an auto-OCR process to follow, it'd be really useful.

klemay commented 4 years ago

hypothesis / lms