sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
11 stars 0 forks source link

PdfCreationFailure on book with all jpeg originals. (May not be a bug.) #2391

Closed honeybadger[bot] closed 10 months ago

honeybadger[bot] commented 11 months ago

https://digital.sciencehistory.org/admin/works/44f4e82#tab=nav-members consists of all jpegs, due to a digitization error. The error message below is helpful, in the sense that there are in fact no suitable images in the work.

[scihist_digicoll/production] WorkPdfCreator2::PdfCreationFailure: WorkPdfCreator2: No PDF files to join; are there no suitable images in work? work: 44f4e82; total_page_count: 0

Backtrace

line 109 of [PROJECT_ROOT]/app/services/work_pdf_creator2.rb: block in write_pdf_to_path
line 87 of [PROJECT_ROOT]/app/services/work_pdf_creator2.rb: write_pdf_to_path
line 43 of [PROJECT_ROOT]/app/services/work_pdf_creator2.rb: create

View full backtrace and more info at honeybadger.io

jrochkind commented 11 months ago

The error message below is helpful, in the sense that there are in fact no suitable images in the work.

Well, there's no theoretical reason we can't make PDF from JPEG, if it's what's there. That would be one path, I'm not sure how challenging it would be;

we currently do things like base calculations on DPI in TIFF, which I guess probably aren't in JPEG, but we already have code in that case that defaults to a guessed DPI.

I'm not totally sure why it couldn't create PDF from JPEG, or how hard it would be to fix. It may be just that this was a "legacy" file that we had to create our new derivatives for, and the routine I ran to create them only looked for content-type image/tiff! But would work fine for new ingests.

So if we wanted to make PDF generation work for JPEG originals, this could be anywhere from trivial to resolve to challenging to resolve.

Of course, we also don't intend JPEG originals I think?

apinkney0696 commented 11 months ago

Please see if there are any other records that don't have TIFFs. There should never be JPGs in the DC for collection items. Sometimes I process JPGs for RR requests, so likely what happened here is that I forgot to change the processing recipe when I exported these images.

jrochkind commented 11 months ago

There is a single-page PDF we create as a derivative, that is used in the PDF generation process.

We (I) actually configured it to only be created for TIFFs!

https://github.com/sciencehistory/scihist_digicoll/blob/4a4cd792352dd87e8f37be1120eeb3f9cdb1b75c/app/uploaders/asset_uploader.rb#L92-L94

So that's one reason this failed. If we wanted PDFs to work for JPEG originals, it might be easy to fix. (there might be other issues once we pass this one)

apinkney0696 commented 11 months ago

We do not want JPG originals.

jrochkind commented 11 months ago

@apinkney0696 asked to investigate how many other similar works/assets there might be.

We do have cases where a JPG asset is expected (Collection thumbs; Oral History portraits). So we have to do a bit of filtering.

There are 283 total JPEG assets whose parents are Works. Let's identify the works from them... that's still 146 unique works. Let's eliminate the OH Works...

OK, now just 7 works. They do all seem to have similar issue.

https://digital.sciencehistory.org/admin/works/hdliweg#tab=nav-members https://digital.sciencehistory.org/admin/works/xdtke7a#tab=nav-members https://digital.sciencehistory.org/admin/works/pb9z78x#tab=nav-members https://digital.sciencehistory.org/admin/works/qp04gli#tab=nav-members https://digital.sciencehistory.org/admin/works/44f4e82#tab=nav-members https://digital.sciencehistory.org/admin/works/btl9jr0#tab=nav-members https://digital.sciencehistory.org/admin/works/oggbhcf#tab=nav-members

apinkney0696 commented 11 months ago

Thank you! All from the same shooting session. I will fix these asap.

apinkney0696 commented 11 months ago

Hi! I just processed all the raw files into TIFs and they are currently ingesting into the S3 bucket. Would it be easy for one of you to mass remove all of the current jpg assets of the above records?

apinkney0696 commented 11 months ago

Also shout out to the EXIF metadata that told me the date the images were created. Made it so easy for me to find my session file!

jrochkind commented 11 months ago

@apinkney0696 Sure! Should I mark the works "private" first, so they don't appear publicly without any images?

And then just just remove all members from those works?

Glad the surfaced metadata provided useful so quickly!

apinkney0696 commented 11 months ago

I believe I've already made them all private, but if I missed one please do. And yes, all jpg members please (which should be everything I think) and leave the metadata. Thanks a bunch!

jrochkind commented 11 months ago

@apinkney0696

Done.

I also notice the new "Searchable PDF" link is showing up even when there are no images, but of course if you click on it you get an error. I'll see if I can fix that maybe.

For the record, ran this code to enact:

["hdliweg", "xdtke7a", "pb9z78x", "qp04gli", "44f4e82", "btl9jr0", "oggbhcf"].each do |work_id|
  work = Work.find_by_friendlier_id(work_id)
  next unless work

  work.published = false
  work.members.destroy_all
  work.save!
end