scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

Create Job that finds and re-ingests non-characterized PDF files #214

Open jeremyf opened 10 months ago

jeremyf commented 10 months ago

We're noticing two problems with PDF ingests:

  1. PDFs files are not being split
  2. PDFs files are not being attached to the FileSet; we have a label but don't have characterization information

The former, not being split, is addressed in scientist-softserv/adventist_knapsack#218 and scientist-softserv/adventist-dl#689. However, we also want to consider those situations where we did not characterize the file; perhaps because it wasn't attached.

We'll need to look for some of latter situations and determine how we might be able to remedy the non-attached and/or non-characterized job.

Consider that the parent work has an AARK_ID, which we could use to re-fetch the file. It also likely has a Bulkrax::Entry (or two or three) that we could use to run a re-ingest the work.

A better solution came from Rob.

The goal is for these FileSets without mime_types to:

  1. Have the correct mime_type
  2. Have the original file attached
  3. Trigger split jobs

Related to:

laritakr commented 10 months ago

Rob was able to attach a missing file to a file_set using the following code. We should be able to plug this into the process to handle the cases where the PDF is missing (except use a perform_later). This should trigger all of the subsequent splitting jobs as well.

operation = Hyrax::Operation.create!(user: user, operation_type: "Attach Remote File")
ImportUrlJob.perform_now(file_set, operation)
laritakr commented 9 months ago

Additional cleanup unrelated to PDF splitting: