scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

Create script to find failed split jobs #218

Closed ShanaLMoore closed 10 months ago

ShanaLMoore commented 10 months ago

Eventually we will need to find all of the works that failed to split PDFs, and re run them.

We will need a script/query to accomplish this.

For each model that must split PDFs; when the work has an attached PDF file set and does not have child works, then we need to resubmit.

The query looks something like: Select all works that have one or more file_sets with mime_type application/pdf and has no child works. It may be that we can apply a quicker filter for all works that don't have children.

We won't run this script until all of the records have been ingested.

Related:

jeremyf commented 10 months ago

Option 1:

Find all documents that have the mime_type_ssi field with value “application/pdf” and who’s id field’s value is not found in the split_from_pdf_id_tsi field of any document in Solr.

In discussion with @laritakr this will not work because the field is a late addition

Option 2:

Find all FileSets that have mime_type_ssi of "application/pdf" OR have a label_ssi that ends with ".pdf" (consider the archival versus reader paradigm). We then submit a job for each FileSet.

The job is then responsible for determining if it should re-split.

jeremyf commented 10 months ago

We can't really "QA" this without unleashing it. It's a bit of a one-time (but let's be honest we'll use it often) task.