scientist-softserv / iiif_print

A gem for Hyrax/Samvera for displaying PDF pages in a IIIF Compliant viewer
Apache License 2.0
4 stars 1 forks source link

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

Open jeremyf opened 1 year ago

jeremyf commented 1 year ago

This follows on the work of https://github.com/scientist-softserv/iiif_print/issues/219 and relates to https://github.com/scientist-softserv/adventist_knapsack/issues/406.

Given a FileSet with an original file of a PDF
And that PDF has been handled by the rodeo
When the IIIF Print gem goes to split the PDF
Then we should use the pre-processed rodeo files instead of running any inline splitting

Discussion

When we split a PDF into multiple pages, we likely do not want to fallback to the Hyrax::FileSetDerivativeService. That service is for converting original files. We instead want to utilize the image, extracted text, etc. that the Derivative::Rodeo created.

We also want to consider that we have existing PDF splitting and do not want to yet disrupt that processing. So the strategy is to create a new process that we use to handle split PDFs. We could, in theory, fall-back to the existing IIIF Print split processing if the PDF does not have pages in the rodeo.

An assumption is that, for a given file, the rodeo will have either none or all of the constituent pages. That is to say, we should not expect that IIIF Print would create the image and handle OCR for a single page of the PDF.

By design, we could demand that the rodeo split the PDF and return the constituent pages and their derivatives.

To consider is the fact that we may not need to wait for all of the splitting jobs. Instead we can: create the child work, create a file set, and assign the rodeo files directly. We will likely not want to run the derivatives for the created file set.

2023-05-31 Notes

To leverage the Derivative Rodeo’s PdfSplitGenerator, we need to create a wrapper class in IIIF Print.

The wrapper class should have a .call method that has the following signature:

def self.call(path, file_set:)
end

That will allow us to replace the inner workings of IiifPrint::Jobs::ChildWorksFromPdfJob#split_pdf (see below)

def split_pdf(original_pdf_path, user, child_model)
  # TODO: This is the place to change out the existing service and instead use the derivative
  # rodeo; we will likely need to look at method signatures to tighten this interface.
  image_files = @parent_work.iiif_print_config.pdf_splitter_service.call(original_pdf_path)

With the file_set, we can use the IiifPrint::DerivativeRodeoService.derivative_rodeo_input_uri to create the pre_process/input_uri of the PDF, which we then pass to the PDFSplitGenerator. And the output templates will need to also consider how we write the file.

##
# This method "hard-codes" some existing assumptions about the input_uri based on
# implementations for Adventist.  Those are reasonable assumptions but time will tell how
# reasonable.
#
# @param file_set [FileSet]
# @return [String]
def self.derivative_rodeo_input_uri(file_set:)
jeremyf commented 1 year ago

I have written two sets of Gherkin-style scenarios, one for a PDF and one for a TIFF. A challenge we have is that we’re using the same SpaceStone handlers for the images of each of the scenarios. That is the extracted image pages of the PDF and the original TIFF.

This is complicated because the output files/directories is different between a PDF and a TIFF. In the case of the images for the PDF, we need to know the parent work ID, the file name, and the page number to correctly associate the generated image with it’s plain text, Alto XML, and word coordinates JSON. In the case of the original TIFF we are only working from the parent work ID and the file name.

At present the SpaceStone handlers and IIIF Print’s calling of the generators are responsible for correctly choosing the right location; this is done via the output and pre-processing template provided to the generators.

A fundamental challenge is that the DerivativeRodeo is templated location agnostic; it provides one set of functions in DerivativeRodeo::Services::ConvertUriViaTemplateService to provide downstream implementations with a means of assigning where we’re writing the files.

SpaceStone has resolved how it’s handling the different location templates for storing the plain text, Alto XML, and word coordinates derivatives.

Next is to resolve how IIIF Print handles this. What we will need to know is when the given FileSet is for a page of a PDF or not; and when it is from a PDF what is it’s page number.

By convention we’ll have that page number based on how SpaceStone is writing that. That page number will be encoded in the location file name. We will likely want to consider the SpaceStone filename storage.

PDF Scenarios

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate thumbnail of that PDF into S3
Then it will be storted at s3://host-bucket/1234/abcd/abcd.pdf.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we split the PDF into one JPEG image per page and store in S3
Then the images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate a thumbnail of each of the page’s images and store in S3
Then the thumbnail images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.thumbnail.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate an ALTO XML of each of the page’s images and store in S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.alto.xml

Image Scenarios

Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate thumbnail of that TIFF into S3
Then it will be storted at s3://host-bucket/1234/efgh/efgh.thumbnail.jpeg
Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate an ALTO XML of that TIFF into S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/efgh.alto.xml
jeremyf commented 1 year ago

Proposal:

In the DerivativeRodeo, we should be setting the output template tale for PDF pages to "#{basename}/pages/#{basename}.page-%d.#{output_extension}". This helps us have a higher confidence that when we just have the filename we can assume it to be a PDF page (and thus help us find all of the other files associated with the page)

https://github.com/scientist-softserv/derivative_rodeo/blob/2ca92617c29febd6be1e5c0a8c98714d4b6f482e/lib/derivative_rodeo/generators/pdf_split_generator.rb#L32-L34