mnylc / islandora_multi_importer

This is a flexible, twig based, all cmodel, tabular data to islandora Object importer with optional ZeroMQ processing
GNU General Public License v3.0
16 stars 15 forks source link

BookCModel and PDFs #119

Open kromabiles opened 4 years ago

kromabiles commented 4 years ago

Hello Diego,

As you know, our IR is currently exploring ways to use the IMI to ingest/create BookCModel objects from PDFs. Since the IMI is the main tool we rely on for ingesting content into Islandora, could we explore/test some possible options/solutions for a way to implement a simpler pdf to tiff image capability? Some of our current PDF objs that we'd like to ingest as books have 50+ pages, which would then need to be divided and converted from pdfs to tiffs. My brain hurts.

More than happy to bounce off ideas and do some testing with you. :)

Best, Katie

DiegoPino commented 4 years ago

@kromabiles great. Following up here. A few questions about this:

Processing of this would need to actually happen during ingest (batch) or it would be just too slow... we need to test.

What is your largest PDF around there?

Secondly. I will also enable a Digital object with the same directly on play.archipelago.nyc so we can test performance and compare.

Thanks!

kromabiles commented 4 years ago

@DiegoPino Yes, extracting PDFs into TIFFs would be great. Our book collections don't have any page level metadata - all structured as single object description. :/

Our largest PDF is about 3GB and consists of 93 pages (yearbook).

Seeing Archipelago in action sounds exciting! :)

DiegoPino commented 4 years ago

Excellent. I will start planning. Will probably borrow book module settings, but i feel i should go TIFF first and the compress to JP2 if needed. I just tested a JP2 generated by islandora (core) and it was 25 Mbytes in size, same TIFF was 10 Mbytes which was a little bit annoying!

DiegoPino commented 4 years ago

@kromabiles sorry for the slowness, i have a solution! But requires some testing, planning. Give me the end of the week to enable in our sandbox and i give you credentials there. Will also copy your Templates and prepare a spreadsheet testcase, but even better if you have a few PDFs in a zip and a demo spreadsheet around

kromabiles commented 4 years ago

No worries! Thanks, Diego - files are too big to attach here, so I'll send them over to you via email.