BookCModel and PDFs - Githubissues

mnylc / islandora_multi_importer

This is a flexible, twig based, all cmodel, tabular data to islandora Object importer with optional ZeroMQ processing

GNU General Public License v3.0

16 stars 15 forks source link

BookCModel and PDFs #119

Open kromabiles opened 4 years ago

kromabiles commented 4 years ago

Hello Diego,

As you know, our IR is currently exploring ways to use the IMI to ingest/create BookCModel objects from PDFs. Since the IMI is the main tool we rely on for ingesting content into Islandora, could we explore/test some possible options/solutions for a way to implement a simpler pdf to tiff image capability? Some of our current PDF objs that we'd like to ingest as books have 50+ pages, which would then need to be divided and converted from pdfs to tiffs. My brain hurts.

More than happy to bounce off ideas and do some testing with you. :)

Best, Katie

DiegoPino commented 4 years ago

@kromabiles great. Following up here. A few questions about this:

You good with IMI extracting PDF into TIFFs? Or do you want IMI to use the same config/Book reader already uses?
How do we deal with page level metadata? I see we have two options (both could be implemented)
- You actually create the rows for each page, and i find some clever UI way of letting IMI know it should only fill the OBJ from the parent column PDF (extracted as TIFF). This also means if you add 10 instead of, e.g 100 pages that the PDF contains it would only ingest 10.
- You add nothing. If so, then IMI will create the most basic Metadata for you, basically just the title and the page number.

Processing of this would need to actually happen during ingest (batch) or it would be just too slow... we need to test.

What is your largest PDF around there?

Secondly. I will also enable a Digital object with the same directly on play.archipelago.nyc so we can test performance and compare.

Thanks!

kromabiles commented 4 years ago

@DiegoPino Yes, extracting PDFs into TIFFs would be great. Our book collections don't have any page level metadata - all structured as single object description. :/

Our largest PDF is about 3GB and consists of 93 pages (yearbook).

Seeing Archipelago in action sounds exciting! :)

DiegoPino commented 4 years ago

Excellent. I will start planning. Will probably borrow book module settings, but i feel i should go TIFF first and the compress to JP2 if needed. I just tested a JP2 generated by islandora (core) and it was 25 Mbytes in size, same TIFF was 10 Mbytes which was a little bit annoying!

DiegoPino commented 4 years ago

@kromabiles sorry for the slowness, i have a solution! But requires some testing, planning. Give me the end of the week to enable in our sandbox and i give you credentials there. Will also copy your Templates and prepare a spreadsheet testcase, but even better if you have a few PDFs in a zip and a demo spreadsheet around

kromabiles commented 4 years ago

No worries! Thanks, Diego - files are too big to attach here, so I'll send them over to you via email.