Exhibits stats needed for SDR PDFs, video, audio (+ Youtube embeds?) to facilitate a11y remediation prioritization

caaster commented 11 months ago

In June 2023 Tom asked me the following:

"We will need some sort of census of media and PDF objects in all published Spotlight exhibits. Is there a smart way to do this? If not, we should add creating such a census to the list of tasks that will need some hand work / brute forcing."

I believe this might be best broken down into the following tasks:

Determine how to query/extract a list of all druids that have been added to all published exhibits
When this step is completed, it will be important to keep track of when it occurred, so that Cathy can monitor any exhibits published after this date (work with exhibit creators to fully understand their uploaded content & if any media files or PDFs are present)
Be sure to filter out all druids that have been set to private by exhibit creators
Determine which druids are collection druids, and expand the list so that all druids in every collection will be considered in the next task (perhaps this step might not be necessary, depending upon how the step below is carried out?)
Determine how to run a query against all druids to create 2 separate lists: one list for media objects, and one list for PDFs
In addition, would it be possible to construct a query that would result in a list of all exhibits that use the video widget, alongside a link for the exhibit page where the video widget is used? We have a hopefully small number of Youtube (or Vimeo) videos that are embedded in exhibits, and these will all need hand-corrected captions to be accessible.

caaster commented 11 months ago

@dinahhandel - is it ok to have all media objects lumped together? Or would you prefer that video and audio are differentiated?

corylown commented 9 months ago

@caaster here's a first pass at the report: https://docs.google.com/spreadsheets/d/1uMCMp1qItXx6LSL0DOakJJKVHrU3Kg3t0HAkwCGmA8Y/edit?usp=sharing

caaster commented 9 months ago

@corylown Thank you for this work. I have a question. Is there any way to determine, for the tab entitled "Druid Report, Nov 22, 2023" - which druids of the type "file" in column D might be PDFs (this is the info that Tom is asking for)? And I confess I don't know what file type the designation "document" pertains to? I worry that whatever the type designation is that was selected by each SDR accessioneer is or is not accurate depending upon a human (and therefore fallible) decision. Is there a specific type that should be used for PDFs? I am flagging @andrewjbtw here for any wisdom he can impart.

andrewjbtw commented 9 months ago

If the "content type" in the report is the content type from the contentMetadata/structural metadata, then that designation should be pretty accurate. If it's from a MODS description field it's more likely to subject to differences in user categorization.

In SDR terminology, "document" is the type that was given to what's essentially a PDF viewer. There seems to have been some ambition to support multiple file types in addition to PDF (like .DOCX) but only PDFs have ever been supported and only PDFs are likely to be supported in the near and medium-term future.

That said, the really difficult part of identifying PDFs in SDR is that they can appear in multiple content types:

book/image (a PDF of each page or a PDF of the whole document may be present, and usually is if the item went through Goobi and ABBYY OCR)
media (a PDF may be an accompanying file, such as a transcript or program)
file (PDFs may be in the list of files)

I think the only reliable way to identify PDFs is to parse the file metadata. Content type will only get you so far.

caaster commented 9 months ago

This is incredibly helpful, @andrewjbtw. Thank you.

caaster commented 9 months ago

@corylown in light of Andrew's very clear explanation above, I propose that I will:

Crib from Andrew's explanations to enhance the spreadsheet readme.
Then, share the results with Tom and ask him if this is sufficient for the needs he has in mind. I also want to revalidate how he wants us to use this info to help determine our accessibility prioritization needs.

When I find out the answers to these questions I'll get back to you. Sound ok?

P.S. Please let me know where you pulled the content type from for the spreadsheet -- refer to Andrew's initial comments:

"If the "content type" in the report is the content type from the contentMetadata/structural metadata, then that designation should be pretty accurate. If it's from a MODS description field it's more likely to subject to differences in user categorization."

corylown commented 9 months ago

@caaster the content type field value is from the public XML's contentMetadata.

Your proposed plan sounds good to me.

corylown commented 9 months ago

@caaster I excluded images from this report, but given Andrew's comment about book/image types possibly being a PDF I'm wondering if image druids should be included. This will greatly increase the number of druids in the report and many of them are likely not to be PDFs/documents.

Caster commented 9 months ago

I guess you meant to tag @caaster, @corylown 🙈

caaster commented 9 months ago

Hi @corylown. I'm doing some double checking of the spreadsheet. I have made a mistake in the information I provided to you. I stated that all videos would be embedded using the video widget. For at least one exhibit (https://exhibits.stanford.edu/exhibits-documentation), that is not always the case. Sometimes the iframe widget was used to display a Youtube video instead, as is the case with 3 Youtube videos on this page: https://exhibits.stanford.edu/exhibits-documentation/feature/iframe/edit. I apologize because I know this represented a lot of tedious work on your end. I wonder how much time it would take to run a query & produce a list of all instances of the iframe widget as well?

Secondly, because I know Tom will ask as soon as I am able to get the spreadsheet on his radar, is: in light of Andrew's explanations, is it possible without days of work to parse the file metadata for exhibit items of content type: document(?), book, image, file and media to identify the list of PDFs for each druid?

P.S. Please note I have added columns to the spreadsheet (specifically the video widget tab) to help me manage future remediation work, just fyi.

corylown commented 9 months ago

@caaster I added a new tab to the google sheet named "iframe Widget Report Dec 04, 2023" with a report of the youtube and vimeo videos embedded using the iframe widget.

I can generate a report of the PDF files, but each druid has to be looked up individually (via a script), which will take a while for several hundred thousand druids. Will report back when I have something.

caaster commented 6 months ago

Closing. Can re-open this ticket if Tom has additional requirements.

caaster commented 4 months ago

Hi @corylown. I am regretfully re-opening this ticket.

The good news is that Hannah and I are using a bunch of the data you generated to identify and prioritized captioning remediation for SDR, Youtube, & Vimeo videos in Spotlight exhibits that are shown on the home, feature, or about pages. Yay!

The not good news is that I made a mistake. When I said that the only widgets used to show videos on a page are the video widget and the iframe widget, I was wrong. It turns out, the item embed widget is also used sometimes. My question is, is it possible to create a query for use of the item embed widget for media files only (audio or video)? We don't care about any other file type right now.

When you have time, let me know. We can start remediating the other files first, but it would be helpful to have this data if it could be created during the next maintenance break?

caaster commented 4 months ago

P.S. I just stumbled across an unexplained missing data point on the current spreadsheet that I wanted to run by you. I'm looking at the tab for the iframe widget. I don't see any PURLs listed, but I have just found one on this page -- it is the iframe widget used for a SDR PURL. Do you know why the query for the use of the iframe widget missed this item? I am concerned there might be other items missed.

corylown commented 4 months ago

@caaster looking back over how I gathered the list of videos in the iframe and embed widgets I may have misinterpreted your instructions because I only looked for youtube or vimeo videos in these widgets. Since I keep everything :-) it's easy enough for me to modify this script to look for PURL embeds too. It should also not be too difficult to expand the script to search the Embed widget. I'll mark this as something for me to look at during our unscheduled week(s) in mid June.

caaster commented 4 months ago

Thank you @corylown - sounds good. I apologize for my lack of clarity on the original task! I wasn't thinking expansively enough when we discussed it, so my mistake :(

corylown commented 2 months ago

@caaster I've added three new tabs to the Exhibit Druid and Video Report spreadsheet:

Video Widget Report June 10, 2024 (appears unchanged from the November 2023 report)
iframe Widget Report June 10, 2024 (Now includes all embeds. I did not filter anything out so it includes google docs and maps among other things)
oembed Widget Report June 10, 2024 (New report of videos embedded using the oembed widget)

caaster commented 2 months ago

@corylown This is excellent, thank you so much! Hannah and I are working on an important remediation project to ensure that every video included on an exhibit home, feature, or about page has corrected captions available. This is so that we can meet an important a11y requirement being enforced by SODA. So this information is essential! I am closing this ticket again now, fingers crossed.

sul-dlss / exhibits

Exhibits stats needed for SDR PDFs, video, audio (+ Youtube embeds?) to facilitate a11y remediation prioritization #2360