hypothesis / support-legacy

a place for tracking support-related work and projects
3 stars 0 forks source link

PDF.js does not recognize semantic markup in PDFs #214

Closed mattdricker closed 2 years ago

mattdricker commented 3 years ago

As reported to us during a meeting with the Ohio State University accessibility team, Hypothesis -- using PDF.js as its PDF viewer -- does not recognize or make visible any semantic markup or tagging (the tag tree) that may be employed by the PDF author. And thus, any such tagging is opaque to screen readers or other adaptive technology tools.

This is a large barrier to being able to meet the accessibility requirements at OSU, and a considerable gap in our striving to meet the accessibility needs of all our users.

mattdricker commented 3 years ago

PDF.js appears to making recent improvements to read the tag tree:

Corey at OSU has emailed us to report that using the latest PDF.js pre-release v2.9.359 may work quite well.

robertknight commented 3 years ago

Thanks for the update Matt. Per the release notes (https://github.com/mozilla/pdf.js/releases/tag/v2.9.359), there are some significant changes to rendering of the hidden text layer in this release:

This release features improved text layer rendering (so words and whitespace better match the rendered page)

This has the potential to impact anchoring existing annotations made with Hypothesis, so we need to test this carefully before we can ship this change.

dwhly commented 3 years ago

This release features improved text layer rendering

This has been an issue for so long. Completely awesome if this really is a substantial improvement.

Obviously we need to understand the impact any changes would have.

However, assuming that:

I think the decision should probably be to proceed anyway (assuming there isn't some magic solution, needing implementation, that would allow us both to proceed and to be able to successfully reanchor historic annotations).

We're still in a kind of happy early state where the large majority of annotations are freshly made on documents each semester, and neither students nor teachers are able to return to the ones they've made earlier in a prior course. That will soon change w/ course copy functionality (at some point) allowing teachers to copy forward annotations made as scaffolding on documents they teach regularly, and also any features that allow students to claim and preserve annotations they make during courses.

Obviously w/ > 25 million annotations now, made over the course of 7 years or so, there may be some pain-- but moving towards better tech for the billions of annotations that will follow probably gets the vote.

mattdricker commented 3 years ago

Internal Slack convos for reference: https://hypothes-is.slack.com/archives/C8TPC8XMK/p1622039652008500 https://hypothes-is.slack.com/archives/C8TPC8XMK/p1625076568000700

mattdricker commented 2 years ago

Solved with update to latest PDF.js https://github.com/hypothesis/pdf.js-hypothes.is/commit/0fc20ea86774dee228f6474502e3038915401d94