psu-libraries / scholarsphere

Penn State's next generation institutional repository
MIT License
12 stars 6 forks source link

Add accessibility checking to deposit workflow #1580

Open binkylush opened 2 weeks ago

binkylush commented 2 weeks ago

API info for sitimprove https://help.siteimprove.com/support/solutions/articles/80000448206-how-to-connect-to-the-siteimprove-api

binkylush commented 2 weeks ago

Sensus Access API info https://www.sensusaccess.com/integrating-sensusaccess/

ajkiessl commented 1 week ago

We have access to SiteImprove so let's test out their API. Still haven't heard back from Sensus, yet.

EricDurante commented 1 week ago

Here's my initial assessment of the tools just based on the documentation that I can find:

Siteimprove

Siteimprove provides a specification for their API and a few notes about each of the endpoints, but no real overall guidance on how to use it (that I can find). Based on the specification, this tool looks like it's far from ideal for our use case. The API appears to support two different methods for checking documents for accessibility:

  1. You can feed it a URL for a website. It will crawl the public portions of the entire site (presumably by following links on pages within the domain, though it doesn't specifically say) and report on the accessibility of each page including both HTML and PDF documents. This feature is probably useless for what we're trying to accomplish since it doesn't give synchronous feedback per document in realtime, and it can only access resources that are publicly available on the web.
  2. You can upload "content" directly for accessibility analysis. According to the API specification, the accepted content types for this method are text/plain, text/html and application/zip. I have no idea what that actually means in terms of what types of documents can be uploaded for analysis, but the list notably doesn't include application/pdf or any other kind of binary or marked-up text document. To understand what this feature actually does, I think that we're going to have to experiment with it. This would be the method that we'd want to use if it works with the right kinds of documents, but based on what I've seen so far, I don't have a lot of hope that it does.

The bottom line is that we should do a little experimentation to be sure, but it doesn't look like Siteimprove is going to be a useful tool for our proposed workflow.

Sensus Access

I can't find any publicly available documentation whatsoever about the Sensus Access API. I tried to sign up for an account to see if that would allow me to view any API docs, but the account sign-up process is apparently manual, and I haven't yet received a response to my request for an account. At this point, I have no idea if Sensus Access provides what we need.

Other tools

I've been searching for other tools that might be a better fit. In particular, I think that it would be ideal if we could find a tool that we can install and run locally alongside the ScholarSphere app rather than requiring us to upload documents to a third-party web service. So far I haven't found anything that looks promising, but I'll continue looking.

EricDurante commented 1 week ago

This may be useful for analyzing PDFs.

EricDurante commented 8 hours ago

It looks like Adobe's PDF Services API is probably going to be our best bet for this. I signed up for a free-tier account and was able to try out the accessibility checker by obtaining an API key and following the general documentation and the specific documentation for the PDF Accessibility Checker.

Based on this documentation, it appears that it may be possible to process documents directly from their location in ScholarSphere's AWS S3 storage.

I looked around for a Ruby wrapper for this REST API, but didn't find anything that looked useful. Since we're only interested in performing one type of operation, I think it will be pretty trivial to implement our own client. For the test that I did, I simply used curl.

I'm not sure if there's an existing Adobe subscription that we can use, but a free-tier account gives us "500 free Document Transactions per month". I'm not yet sure what exactly counts as a "document transaction" since I don't currently see any activity when I check my API key usage after running the test.