Closed ShanaLMoore closed 10 months ago
Option 1:
Find all documents that have the mime_type_ssi field
with value “application/pdf” and who’s id
field’s value is not found in the split_from_pdf_id_tsi
field of any document in Solr.
In discussion with @laritakr this will not work because the field is a late addition
Option 2:
Find all FileSets that have mime_type_ssi
of "application/pdf" OR have a label_ssi
that ends with ".pdf" (consider the archival versus reader paradigm). We then submit a job for each FileSet.
The job is then responsible for determining if it should re-split.
IiifPrint::Jobs::RequestSplitPdfJob
We can't really "QA" this without unleashing it. It's a bit of a one-time (but let's be honest we'll use it often) task.
Eventually we will need to find all of the works that failed to split PDFs, and re run them.
We will need a script/query to accomplish this.
For each model that must split PDFs; when the work has an attached PDF file set and does not have child works, then we need to resubmit.
The query looks something like: Select all works that have one or more file_sets with mime_type
application/pdf
and has no child works. It may be that we can apply a quicker filter for all works that don't have children.We won't run this script until all of the records have been ingested.
Related: