Create script to find failed split jobs

ShanaLMoore commented 10 months ago

Eventually we will need to find all of the works that failed to split PDFs, and re run them.

We will need a script/query to accomplish this.

For each model that must split PDFs; when the work has an attached PDF file set and does not have child works, then we need to resubmit.

The query looks something like: Select all works that have one or more file_sets with mime_type application/pdf and has no child works. It may be that we can apply a quicker filter for all works that don't have children.

We won't run this script until all of the records have been ingested.

scientist-softserv/adventist_knapsack#224

jeremyf commented 10 months ago

Option 1:

Find all documents that have the mime_type_ssi field with value “application/pdf” and who’s id field’s value is not found in the split_from_pdf_id_tsi field of any document in Solr.

In discussion with @laritakr this will not work because the field is a late addition

Option 2:

Find all FileSets that have mime_type_ssi of "application/pdf" OR have a label_ssi that ends with ".pdf" (consider the archival versus reader paradigm). We then submit a job for each FileSet.

The job is then responsible for determining if it should re-split.

This job will delete all IiifPrint::PendingRelationship records for the file set's parent work.
When the file_set's parent work has child works; return early
When the file_set does not have a mime_type, we likely failed to attach the file and need to perform a re-ingest of the files/work
- see scientist-softserv/adventist_knapsack#214
When the file_set does have a mime_type, we likely have a file and need to submit the IiifPrint::Jobs::RequestSplitPdfJob

jeremyf commented 10 months ago

We can't really "QA" this without unleashing it. It's a bit of a one-time (but let's be honest we'll use it often) task.

scientist-softserv / adventist_knapsack

Create script to find failed split jobs #218