Open jeremyf opened 1 year ago
Sidebar: As I think about SPD, IIIF Print gem, and SpaceStone, I’m wondering if the better approach is to move them all into IIIF Print. Then I would declare what groups to use (akin to what we do with
web
andworker
). This does create a weird space where there could be inadvertent bleed. For now, the extraction and separation of concerns feels like the correct exploratory exercise./As I think a bit more on this, I don’t believe merging the three gems together is the right approach. Given that one is envisioned as an Engine and the other as conceptual shell scripts. Conflating those three concepts seems like it make it harder to implement towards a clean and crisp interface.
Yesterday, I brought over several classes/utilities from the IIIF Print gem. These were lower level functions that are used by processes within IIIF Print gem. , I need to bring over the remaining lower level classes and utilities. I am presently working on bringing over PageOCR.
I have locally setup my SPD development so that I run rubocop
and rspec
on each commit and each push to the remote repository. I /have not setup continuous integration on the remote repository.
I will also be bringing over the Samvera::Derivatives
interfaces; those belong in SPD. The Hyrax specific implementation does not. The crease I’m looking for is now to create locators for SPD and pre-processors for SPD.
The pre-processors will echo what was done in Newman Numismatic Portal. The pre-processor, in general will receive a derivative_type
and an identifier
.
One derivative_type
is “original” (perhaps we should rename this). Other examples, albeit somewhat arbitrary, are thumbnail
, text
, and hocr
.
We will have an SpaceStone “entry” that looks like the following:
Each project that uses SpaceStone will need to name (e.g. “thumbnail”) what derivatives it wants to generate and the function for generating that derivative from the original
.
Below is a feature description:
Given an entry with the identifier
And the original (with corresponding URL)
And the "thumbnail" derivative type (with corresponding URL)
And SpaceStone is configured to generate "text" derivatives
And SpaceStone is configured to generate "thumbnail" derivatives
When the SpaceStone Lambda processes the given entry
Then SpaceStone will not generate the "thumbnail" derivative
And will fetch the "thumbnail" derivative
And store the "thumbnail" derivative
And SpaceStone will generate the "text" derivative
And SpaceStone will not fetch the "text" derivative
And SpaceStone will store the "text" derivative
Related to:
I have brought in the PageOCR logic from IIIF Print. One observed problem/challenge is that it has an interface that will need reworking; the exposed public methods are a bit confusing (so I need to perform more analysis). Another challenge is that it generates three or four different derivatives that go towards text extraction.
My plan for 2023-03-29 is to disentangle the different file creation processes so that we can use the named files provided instead of generating. There is some preliminary work.
Specific tasks are:
SpaceStone::Derivatives::Configuration
Samvera::Derivatives
from IIIF Print (folding into the SpaceStone::Derivatives)A competing priority is deploying changes to the British Library and ensuring that is in a good state for their end of week priority.
With a bit of retrospective, the most critical decision I made early was to name each derivative (e.g. :hocr
, :text
, :monochrome
); after all we have a named “file” for each of those. In doing so, I have a conceptual object in which to organize my code.
Yesterday, I also turned a major corner on this project. In the morning I sat down and started writing narrative descriptions in the README of SpaceStone::Derivatives.
This lead to naming the concepts of the SpaceStone::Derivatives::Manifest and the SpaceStone::Derivatives::Repository. With those names, I had eliminated some of my mental barriers regarding the various layers of abstractions and mappings.
A second revelation was when I stepped away from the code and started verbally narrating. I had hit a minor mental block, fixating on a low-level detail to which I had lost the thread to the larger feature requirement. The inspiration came when I named the SpaceStone::Derivatives.pre_process_derivatives_for
method.
That named method gave me the clear mental map into the process steps.
Immediately, I knew I would need to resolve the dependency graph of derivatives. I first started with a validator function that could process a hash.
Then I thought about how I would perform the sequence. Which introduced the idea of the SpaceStone::Derivatives::Chain. I moved the validator into that class and began working on a sequencer function; likewise the sequencer would process a hash. As I delved deeper, the Validator and Sequencer were performing duplicate logic.
I had begun stenciling in the SpaceStone::Derivatives::Types in an effort to play with the conceptual idea.
I ripped out the validator and settled on SpaceStone::Derivatives::Chain::Sequencer; again something I could test with a Hash that had symbol keys and values that were arrays of symbols.
With the sequence of derivative generation resolved, I set about the conceptual SpaceStone::Derivative::Processor. It began it’s life named “PreProcessor” but as I was writing the documentation, I wrote “send the pre_process! message to the each of the types in the chain.” With the word “message”, I realized I could use a dispatching strategy (e.g. send(message, repository: repository)
).
A key design consideration has been quick and easy testing. And during the day’s development, I refactored the method signatures a few times. Each time spending a few minutes changing and running tests.
For 2023-03-29 I plan to look into the following:
SpaceStone::Derivatives::Types
.Manifest::FileLocationSet
as a named parameter. I’ll need to play with that a bit.More important is getting the entire pre-processing ready to run via SpaceStone proper (e.g. AWS Lambda).
As an added benefit, I believe that SpaceStone::Derivatives
is almost certainly vendor agnostic (managed instead by the yet to be made repository file storage strategy).
I continue to rely on local tests, both style guides and rspec. These run each time I commit code and each time I push code to the Github.
Inspired by LeaAnn’s “Project and Task Breakdown” presentation on , I wanted to write up the task breakdown/algorithm:
For the pre-processing in AWS:
What is the file “handle”? Perhaps the path name.
I also say “AWS” but this is really the pre-processing environment; a “loading dock” if you will.
In the above case, we only want to verify that we have a “handle”. If the “handle” does not exist, that is an error in processing the manifest. Put another way, once we’ve processed the manifest, we need to audit it’s integrity.
Thoughts from 2023-04-05:
I have a working proof of concept for monochrome and hocr. Now I need to look into PDF splitting. I start with an original file that is a PDF.
I want to make a thumbnail of the PDF. I also want to split the PDF. When I split the PDF, I’m probably going to create a manifest for each of the pages. And then feed those manifests to the processor.
I likely want the original PDF and the split files to be in a similar location (for easier finding). What would that look like?
/path/to/:parent_id/:original_file/<original>
/path/to/:parent_id/:original_file/pdf_split/:index/<image>
/path/to/:parent_id/:original_file/pdf_split/:index/<monochrome>
/path/to/:parent_id/:original_file/pdf_split/:index/<hocr>
As written, I do not have a consistent predictable temporary directory creation process. The input for derivatives is:
I realized that I need to lean into the Chain
concept. Namely, because of the async nature of AWS, I need to process the chain as follows:
Given a chain of <A>, <B>, <C>, and <D>
When I “schedule” <A>, I need to provide the chain.
Then as part of completing <A>, it “scheduless” <B>.
In the above example, let’s say that <B>
is the :split_pdf
. It is responsible for launching the “sub-processes” of :ocr
. An assumption is that the given Chain
creates the files that later links are dependent on. In other words, the sub-processes of :ocr
are not dependents of the above <C>
nor <D>
. And the sibling processes of split pages are not dependent on each other.
Ideally, we don’t need to notify the parent <B>
that all children are done. Due to convention, <B>
might want to write it’s manifest of indices that it wants to generate.
Given a parent process <B>
And the child chain <Ba>, <Bb>, <Bc>
And the children <0>, <1>, <2>
When I “schedule” <0>, I need to provide the chain.
Then as part of completing <0>’s <Ba>, it “schedules” <Bb> for <0>.
Critical in this is once I start processing an “original” manifest, I need to preserve the storage “handles” for both the “local” and the “remote”. Those handles, along with the chain, help the processing locate either pre-existing files or fetch from a common remote location.
I also need the “processing queue” to provide an in-line option or to send things to AWS’s SQS.
We need to ensure that our conversation with Katharine is “encoded” in the CSV logic. We have PDF and we have Image as the original file. We also can ignore the Periodical images et. al (because those images are the representative image which Hyku does not account for).
Our plan is to add the archival, reader, and txt as three FileSets on a work. The archival we will pre-process derivative generation. The reader we only want thumbnails. And the text we can use Hyrax::FileSetDerivativesService as written.
We need to somehow communicate that the reader fileset does not split nor have much of any other derivatives.
This was pushed up by Rob and merged by Jeremy; he’s working through the logic and how we’ll map accordingly. We can likely re-use the thumbnail for the reader; this might mean copying the original thumbnail twice.
We need to deploy the latest SpaceStone. Jeremy needs to be able to do this, as does Kirk. First, we try Kirk as he has successfully deployed once; and we likely won’t need more deploys of SpaceStone.
We want to review the SpaceStone S3 buckets to ensure they are structured as intended.
With the latest changes, which are pending review but will be merged by EoD Tuesday, we are prepared to update Adventist to use the Derivative Rodeo enabled IIIF Print.
We need to update Adventist’s IIIF Print gem to get the Derivative Rodeo; this will help inform how we post to SpaceStone (as there’s a configuration assumption which is marked as a TODO)
Goals
The goal of this epic is to adjust the derivative “creation” process. At present Hyrax creates a `FileSet` for each file added to a work either via the UI or via batch processing. We then process each original file of a `FileSet` creating derivative files that we attach to that `FileSet`. This is all done “in-band”, which can be non-performant for large imports. To speed up the imports we can: - add more resources to the import system - perform “out of band” processing to generate derivatives By introducing the idea of the “out of band” processing, we break the fundamental assumption of Hyrax’s derivative generation. Namely that it will take a `FileSet` and create all the necessary derivatives. Instead, with pre-processing, we are now saying “There may already exist a derivative for this `FileSet`, attach that instead.” Instead of having one conceptual “Create Derivatives” function we are looking to have three: - **Pre-processing:** doing out of band work to create derivatives. - **Locating:** Finding the existing derivatives (or possibly indicating that we’d create new ones). - **Applying:** Taking the found derivative and adding it to the correct FileSet. We have a “prior art” instance of *Pre-processing* in the [SpaceStone](https://github.com/scientist-softserv/space_stone) Ruby gem. That gem is code that runs in AWS Lambdas to pull data from Internet Archive (per the client’s previous storage), split apart the PDFs, create OCR derivatives of each page, and create thumbnails. Further, we have “prior art” for *Locating* and *Applying* `.hocr` files in [NNP’s Tesseract module](https://github.com/scientist-softserv/nnp/blob/78206122b9796a79f07f349ef27babc167006f6d/app/services/tesseract.rb); responsible for first looking for a remote `.hocr` and failing that generating the `.hocr` file. We have some logic for finding the *Pre-processing* files in [NNP’s DistributedRemoteDownloader](https://github.com/scientist-softserv/nnp/blob/78206122b9796a79f07f349ef27babc167006f6d/app/services/distributed_remote_downloader.rb). Those are specific implementations that demonstrate some of the necessary concepts. However, those are different from the immediate needs of Adventist and the general needs of other clients.Scenarios
In terms of *Pre-Processing* we need the following: ```gherkin Scenario: Pre-processing Given an identifier for a FileSet And a URL to the original file And the file is a PDF When I pass the identifier and URL to an AWS lambda Then the Pre-processer Lambda will create onePre-Process Tasks
Adventist S3 Convention: there’s an AARK unique for each work. Write files there.
Ingest Task
Task Scratch Pad
Not all of these will be converted to tasks; they instead represent a current workign understanding. Once we create a task from the checkmark, then we’re looking more at actionable tasks.
hocr
,thumbnail
,splits
, etc) and the identifier.