scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

EPIC: Import Optimization with Out of Band Processing #421

Open jeremyf opened 1 year ago

jeremyf commented 1 year ago
Goals The goal of this epic is to adjust the derivative “creation” process. At present Hyrax creates a `FileSet` for each file added to a work either via the UI or via batch processing. We then process each original file of a `FileSet` creating derivative files that we attach to that `FileSet`. This is all done “in-band”, which can be non-performant for large imports. To speed up the imports we can: - add more resources to the import system - perform “out of band” processing to generate derivatives By introducing the idea of the “out of band” processing, we break the fundamental assumption of Hyrax’s derivative generation. Namely that it will take a `FileSet` and create all the necessary derivatives. Instead, with pre-processing, we are now saying “There may already exist a derivative for this `FileSet`, attach that instead.” Instead of having one conceptual “Create Derivatives” function we are looking to have three: - **Pre-processing:** doing out of band work to create derivatives. - **Locating:** Finding the existing derivatives (or possibly indicating that we’d create new ones). - **Applying:** Taking the found derivative and adding it to the correct FileSet. We have a “prior art” instance of *Pre-processing* in the [SpaceStone](https://github.com/scientist-softserv/space_stone) Ruby gem. That gem is code that runs in AWS Lambdas to pull data from Internet Archive (per the client’s previous storage), split apart the PDFs, create OCR derivatives of each page, and create thumbnails. Further, we have “prior art” for *Locating* and *Applying* `.hocr` files in [NNP’s Tesseract module](https://github.com/scientist-softserv/nnp/blob/78206122b9796a79f07f349ef27babc167006f6d/app/services/tesseract.rb); responsible for first looking for a remote `.hocr` and failing that generating the `.hocr` file. We have some logic for finding the *Pre-processing* files in [NNP’s DistributedRemoteDownloader](https://github.com/scientist-softserv/nnp/blob/78206122b9796a79f07f349ef27babc167006f6d/app/services/distributed_remote_downloader.rb). Those are specific implementations that demonstrate some of the necessary concepts. However, those are different from the immediate needs of Adventist and the general needs of other clients.
Scenarios In terms of *Pre-Processing* we need the following: ```gherkin Scenario: Pre-processing Given an identifier for a FileSet And a URL to the original file And the file is a PDF When I pass the identifier and URL to an AWS lambda Then the Pre-processer Lambda will create one per PDF page And the Pre-processer Lambda will create one HOCR per PDF page And the Pre-processer Lambda will create one Thumbnail per PDF page ``` In terms of *Locating* we need the following: ```gherkin Scenario: Locating derivatives that were pre-processed Given an identifier for a FileSet And the Pre-processer Lambda has processed the FileSet When I attempt to locate the for the split pages Then I will get the location of the S3 files Scenario: Locating derivatives that were not pre-processed Given an identifier for a FileSet And the Pre-processer Lambda has not processed the FileSet When I attempt to locate the for the split pages Then I will not get a location of the S3 files And I will get a “location” that will indicate to use the default Hyrax derivative generation ``` In terms of *Applying* we need the following: ```gherkin Scenario: Applying derivatives that were pre-processed Given an identifer for a FileSet And the Pre-processer Lambda has processed the FileSet When I attach the located for the split images Then I will fetch those derivatives from S3 And attach those derivatives to the FileSet Scenario: Applying derivatives that were not pre-processed Given an identifer for a FileSet And the Pre-processer Lambda has not processed the FileSet When I attach the located for the split images Then I will generate those derivatives And attach those derivatives to the FileSet ``` One consideration is that Hyrax has implicit derivatives that are named concepts; for the above features we need to expose those named concepts. Namely, how do I locate and apply the thing called `thumbnail`?

Pre-Process Tasks

Adventist S3 Convention: there’s an AARK unique for each work. Write files there.

Ingest Task

Task Scratch Pad

Not all of these will be converted to tasks; they instead represent a current workign understanding. Once we create a task from the checkmark, then we’re looking more at actionable tasks.

jeremyf commented 1 year ago

Sidebar: As I think about SPD, IIIF Print gem, and SpaceStone, I’m wondering if the better approach is to move them all into IIIF Print. Then I would declare what groups to use (akin to what we do with web and worker). This does create a weird space where there could be inadvertent bleed. For now, the extraction and separation of concerns feels like the correct exploratory exercise./

As I think a bit more on this, I don’t believe merging the three gems together is the right approach. Given that one is envisioned as an Engine and the other as conceptual shell scripts. Conflating those three concepts seems like it make it harder to implement towards a clean and crisp interface.

Yesterday, I brought over several classes/utilities from the IIIF Print gem. These were lower level functions that are used by processes within IIIF Print gem. , I need to bring over the remaining lower level classes and utilities. I am presently working on bringing over PageOCR.

I have locally setup my SPD development so that I run rubocop and rspec on each commit and each push to the remote repository. I /have not setup continuous integration on the remote repository.

I will also be bringing over the Samvera::Derivatives interfaces; those belong in SPD. The Hyrax specific implementation does not. The crease I’m looking for is now to create locators for SPD and pre-processors for SPD.

The pre-processors will echo what was done in Newman Numismatic Portal. The pre-processor, in general will receive a derivative_type and an identifier.

One derivative_type is “original” (perhaps we should rename this). Other examples, albeit somewhat arbitrary, are thumbnail, text, and hocr.

We will have an SpaceStone “entry” that looks like the following:

Each project that uses SpaceStone will need to name (e.g. “thumbnail”) what derivatives it wants to generate and the function for generating that derivative from the original.

Below is a feature description:

Given an entry with the identifier
  And the original (with corresponding URL)
  And the "thumbnail" derivative type (with corresponding URL)
  And SpaceStone is configured to generate "text" derivatives
  And SpaceStone is configured to generate "thumbnail" derivatives
When the SpaceStone Lambda processes the given entry
Then SpaceStone will not generate the "thumbnail" derivative
  And will fetch the "thumbnail" derivative
  And store the "thumbnail" derivative
  And SpaceStone will generate the "text" derivative
  And SpaceStone will not fetch the "text" derivative
  And SpaceStone will store the "text" derivative

Related to:

jeremyf commented 1 year ago

I have brought in the PageOCR logic from IIIF Print. One observed problem/challenge is that it has an interface that will need reworking; the exposed public methods are a bit confusing (so I need to perform more analysis). Another challenge is that it generates three or four different derivatives that go towards text extraction.

My plan for 2023-03-29 is to disentangle the different file creation processes so that we can use the named files provided instead of generating. There is some preliminary work.

Specific tasks are:

  1. Create a SpaceStone::Derivatives::Configuration
    • Need a tesseract additional command line options; used for specifying different trained data sets.
  2. Bring in some of Samvera::Derivatives from IIIF Print (folding into the SpaceStone::Derivatives)
  3. Disentangle the “hocr” file creation process to be able to use an existing “hocr” file.

A competing priority is deploying changes to the British Library and ensuring that is in a good state for their end of week priority.

jeremyf commented 1 year ago

With a bit of retrospective, the most critical decision I made early was to name each derivative (e.g. :hocr, :text, :monochrome); after all we have a named “file” for each of those. In doing so, I have a conceptual object in which to organize my code.

Yesterday, I also turned a major corner on this project. In the morning I sat down and started writing narrative descriptions in the README of SpaceStone::Derivatives.

This lead to naming the concepts of the SpaceStone::Derivatives::Manifest and the SpaceStone::Derivatives::Repository. With those names, I had eliminated some of my mental barriers regarding the various layers of abstractions and mappings.

A second revelation was when I stepped away from the code and started verbally narrating. I had hit a minor mental block, fixating on a low-level detail to which I had lost the thread to the larger feature requirement. The inspiration came when I named the SpaceStone::Derivatives.pre_process_derivatives_for method.

That named method gave me the clear mental map into the process steps.

Immediately, I knew I would need to resolve the dependency graph of derivatives. I first started with a validator function that could process a hash.

Then I thought about how I would perform the sequence. Which introduced the idea of the SpaceStone::Derivatives::Chain. I moved the validator into that class and began working on a sequencer function; likewise the sequencer would process a hash. As I delved deeper, the Validator and Sequencer were performing duplicate logic.

I had begun stenciling in the SpaceStone::Derivatives::Types in an effort to play with the conceptual idea.

I ripped out the validator and settled on SpaceStone::Derivatives::Chain::Sequencer; again something I could test with a Hash that had symbol keys and values that were arrays of symbols.

With the sequence of derivative generation resolved, I set about the conceptual SpaceStone::Derivative::Processor. It began it’s life named “PreProcessor” but as I was writing the documentation, I wrote “send the pre_process! message to the each of the types in the chain.” With the word “message”, I realized I could use a dispatching strategy (e.g. send(message, repository: repository)).

A key design consideration has been quick and easy testing. And during the day’s development, I refactored the method signatures a few times. Each time spending a few minutes changing and running tests.

For 2023-03-29 I plan to look into the following:

More important is getting the entire pre-processing ready to run via SpaceStone proper (e.g. AWS Lambda).

As an added benefit, I believe that SpaceStone::Derivatives is almost certainly vendor agnostic (managed instead by the yet to be made repository file storage strategy).

I continue to rely on local tests, both style guides and rspec. These run each time I commit code and each time I push code to the Github.

jeremyf commented 1 year ago

Inspired by LeaAnn’s “Project and Task Breakdown” presentation on , I wanted to write up the task breakdown/algorithm:

For the pre-processing in AWS:

  1. Check if the file exists in the expected AWS location. If it does, return a “handle” to it.
  2. Else, if it doesn’t and the manifest says it has a remote URL, attempt to GET it.
    • On a 404, log a warning and return “nil”
    • On a 2xx, copy it into the expected location, and return the “handle”
    • On any other status, log an error and raise an exception.
  3. Else, if it can’t be remotely fetched, attempt to Generate it.
    • On a failure to generate, log an error and raise an exception.
    • On a success but there’s no file, log an error and raise an exception.
    • On a success with a file, move the file to the expected location and return the “handle”.

What is the file “handle”? Perhaps the path name.

I also say “AWS” but this is really the pre-processing environment; a “loading dock” if you will.

In the above case, we only want to verify that we have a “handle”. If the “handle” does not exist, that is an error in processing the manifest. Put another way, once we’ve processed the manifest, we need to audit it’s integrity.

jeremyf commented 1 year ago

Thoughts from 2023-04-05:

I have a working proof of concept for monochrome and hocr. Now I need to look into PDF splitting. I start with an original file that is a PDF.

I want to make a thumbnail of the PDF. I also want to split the PDF. When I split the PDF, I’m probably going to create a manifest for each of the pages. And then feed those manifests to the processor.

I likely want the original PDF and the split files to be in a similar location (for easier finding). What would that look like?

/path/to/:parent_id/:original_file/<original> /path/to/:parent_id/:original_file/pdf_split/:index/<image> /path/to/:parent_id/:original_file/pdf_split/:index/<monochrome> /path/to/:parent_id/:original_file/pdf_split/:index/<hocr>

As written, I do not have a consistent predictable temporary directory creation process. The input for derivatives is:

  1. Parent ID
  2. Original Filename
  3. Original URL
  4. URL
  5. Working directory…if it exists, use that, otherwise, create a new one and assign.
jeremyf commented 1 year ago

I realized that I need to lean into the Chain concept. Namely, because of the async nature of AWS, I need to process the chain as follows:

Given a chain of <A>, <B>, <C>, and <D>
When I “schedule” <A>, I need to provide the chain.
Then as part of completing <A>, it “scheduless” <B>.

In the above example, let’s say that <B> is the :split_pdf. It is responsible for launching the “sub-processes” of :ocr. An assumption is that the given Chain creates the files that later links are dependent on. In other words, the sub-processes of :ocr are not dependents of the above <C> nor <D>. And the sibling processes of split pages are not dependent on each other.

Ideally, we don’t need to notify the parent <B> that all children are done. Due to convention, <B> might want to write it’s manifest of indices that it wants to generate.

Given a parent process <B>
And the child chain <Ba>, <Bb>, <Bc>
And the children <0>, <1>, <2>
When I “schedule” <0>, I need to provide the chain.
Then as part of completing <0>’s <Ba>, it “schedules” <Bb> for <0>.

Critical in this is once I start processing an “original” manifest, I need to preserve the storage “handles” for both the “local” and the “remote”. Those handles, along with the chain, help the processing locate either pre-existing files or fetch from a common remote location.

I also need the “processing queue” to provide an in-line option or to send things to AWS’s SQS.

jeremyf commented 1 year ago

June 5 to June 9 Sprint Adventist Tasks

OAI -> CSV

We need to ensure that our conversation with Katharine is “encoded” in the CSV logic. We have PDF and we have Image as the original file. We also can ignore the Periodical images et. al (because those images are the representative image which Hyku does not account for).

Our plan is to add the archival, reader, and txt as three FileSets on a work. The archival we will pre-process derivative generation. The reader we only want thumbnails. And the text we can use Hyrax::FileSetDerivativesService as written.

We need to somehow communicate that the reader fileset does not split nor have much of any other derivatives.

CSV -> SpaceStone

This was pushed up by Rob and merged by Jeremy; he’s working through the logic and how we’ll map accordingly. We can likely re-use the thumbnail for the reader; this might mean copying the original thumbnail twice.

SpaceStone does it thing

We need to deploy the latest SpaceStone. Jeremy needs to be able to do this, as does Kirk. First, we try Kirk as he has successfully deployed once; and we likely won’t need more deploys of SpaceStone.

We want to review the SpaceStone S3 buckets to ensure they are structured as intended.

IIIF Print pulls from SpaceStone

With the latest changes, which are pending review but will be merged by EoD Tuesday, we are prepared to update Adventist to use the Derivative Rodeo enabled IIIF Print.

Configure Adventist’s DerivativeRodeo

We need to update Adventist’s IIIF Print gem to get the Derivative Rodeo; this will help inform how we post to SpaceStone (as there’s a configuration assumption which is marked as a TODO)