Derivatives for PDF Extracted Text files are not created

bseeger commented 2 years ago

This not a bug, really, so much as it's outside of what the system is designed to do out of the box.

Right now derivative generation is focused on Original Files, so any other type of file may not get a derivative made.

For example, the oral history records have the mp3 as the Original and Service file along with a PDF transcript marked as Extracted Text. The system will not extract data from the PDF file automatically, as it is only setup to trigger that if the PDF were an original file. (there should only be one Original File per media set).

The way this could work is to create more Drupal Contexts and Actions to support this behavior. All the material is there, at least in some form, and with some tweaking and new wiring we could create a context that triggers on PDF files being added as Extracted Text to Audio/Video/Image Nodes. A new Action would be needed to pull in the right Media Use type to create the derivative out of (right now the Action pulls the Original File).

bseeger commented 2 years ago

A new Action would be added as a plugin, like so: https://github.com/Islandora/islandora/blob/2.x/modules/islandora_text_extraction/src/Plugin/Action/GenerateOCRDerivative.php

Maybe the most logical place to put it in our setup is here: https://github.com/jhu-idc/idc_defaults/tree/main/src/Plugin in an Actions folder.

noahwsmith commented 2 years ago

Bethany has the right overview- this is fairly straightforward to configure. Unfortunately we don't have any availability in the next six weeks or so to assist with this.

jaredgalanis commented 2 years ago

Closed by https://github.com/jhu-idc/idc-isle-dc/pull/301

jhu-idc / iDC-general

Derivatives for PDF Extracted Text files are not created #472