uml-digitalinitiatives / multipage_to_book_batch_converter

Create a directory hierarchy suitable for ingesting via islandora_book_batch from one (or several) multi-page PDFs or Tiffs
Other
3 stars 1 forks source link

Summary

This script takes one or more multi-page PDFS or Tiffs and generates the directory structure necessary to ingest it into an Islandora instance as a book object.

It assumes that your source objects contain the entirety of a single book.

Installation

This script requires Python 3.

  1. Clone the this repository
  2. Install dependencies pip (or pip3) install -r requirements.txt
  3. Run

To run this script requires the existence of:

in the working PATH.

It also needs:

If you specify the --skip-derivatives option, neither is required.

multipage2book.py

This is the main script which does the bulk of the work in generating your book object.

The script takes the file or a directory of files for each file it creates a clean directory name of the file, with spaces replaced by underscores and the word _dir at the end.

ie. The Heart of the Continent.tiff --> The_Heart_of_the_Continent_dir

If you provide the --mods-dir option, it should point to a directory containing MODS files with the same name as the source file but with a .mods extension. (ie. The_Heart_of_the_Continent.mods).

Note: You can alter the MODS file extension with the --mods-extension argument.

If you don't provide a --mods-dir option but your files argument is a directory, then that same directory will be checked for MODS files.

Configuration options

Running the multipage2book.py with a -h or --help argument will get you a description of the possible options.

usage: multipage2book.py [-h] [--password PASSWORD] [--overwrite] [--language LANGUAGE] [--resolution RESOLUTION] [--use-hocr] [--mods-dir MODS_DIR] [--mods-extension MODS_EXTENSION]
                         [--output-dir OUTPUT_DIR] [--merge] [--skip-derivatives] [--skip-hocr-ocr] [--skip-jp2] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                         files

Turn a PDF/Tiff or set of PDFs/Tiffs into properly formatted directories for Islandora Book Batch.

positional arguments:
  files                 A file or directory of files to process.

optional arguments:
  -h, --help            show this help message and exit
  --password PASSWORD   Password to use when parsing PDFs.
  --overwrite           Overwrite any existing Tiff/PDF/OCR/Hocr files with new copies.
  --language LANGUAGE   Language of the source material, used for OCRing. Defaults to eng.
  --resolution RESOLUTION
                        Resolution of the source material, used when generating Tiff. Defaults to 300.
  --use-hocr            Generate OCR by stripping HTML characters from HOCR, otherwise run tesseract a second time. Defaults to use tesseract.
  --mods-dir MODS_DIR   Directory of files with a matching name but with the extension "mods" to be added to the books.
  --mods-extension MODS_EXTENSION
                        The extension of the MODS files existing in the above directory. Files are matched based on filename but with this extension. Defaults to 'mods'
  --output-dir OUTPUT_DIR
                        Directory to build books in, defaults to current directory.
  --merge               Files that have the same name but with a numeric suffix are considered the same book and directories are merged. (ie. MyBook1.pdf and MyBook2.pdf)
  --skip-derivatives    Only split the source file into the separate pages and directories, don't generate derivatives.
  --skip-hocr-ocr       Do not generate OCR/HOCR datastreams, this cannot be used with --skip-derivatives
  --skip-jp2            Do not generate JP2 datastreams, this cannot be used with --skip-derivatives
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set logging level, defaults to ERROR.

Examples

  1. Process a PDF file into the correct directory structure with just each PDF page split out.

    ./multipage2book.py --output-dir=OUTPUT --skip-derivatives MyBook.pdf 

    This creates the following structure

    OUTPUT/
          MyBook_dir/
                     PDF.pdf
                     1/
                       PDF.pdf
                     2/
                       PDF.pdf
                     ...
  2. Process a PDF file into the correct directory structure with simple derivatives from the source.

    ./multipage2book.py --output-dir=OUTPUT --skip-hocr-ocr --skip-jp2 MyBook.pdf 

    This creates the following structure

    OUTPUT/
          MyBook_dir/
                     PDF.pdf
                     TN.jpg
                     1/
                       OBJ.tiff
                       PDF.pdf
                       JPG.jpg
                       TN.jpg
                     2/
                       OBJ.tiff
                       PDF.pdf
                       JPG.jpg
                       TN.jpg
                     ...
  3. Process a PDF file into the correct directory structure processing the MODS file down to the pages.

    Assuming a directory called "INPUT"

    INPUT/
          MyPDF.pdf
          MyPDF.xml

    Then calling:

    ./multipage2book.py INPUT --output-dir=/output/directory --skip-hocr-ocr --skip-jp2 --mods-extension=xml 

    This creates the following structure

    OUTPUT/
          MyPDF_dir/
                     PDF.pdf
                     MODS.xml
                     TN.jpg
                     1/
                       JPG.jpg
                       MODS.xml
                       OBJ.tiff
                       PDF.pdf
                       TN.jpg
                     2/
                       JPG.jpg
                       MODS.xml
                       OBJ.tiff
                       PDF.pdf
                       TN.jpg
                     ...
  4. The --merge option is useful, but problematic. Its use case is when a single Tiff could not hold all the pages of a book. In which case so long as the various files share a common basename but with an integer appended. (ie. SomeBook1.tiff, SomeBook2.tiff, SomeBook3.tiff). These books will all be combined into a single set of pages.

    Normally you can process a book overtop of a previous run, the script will just fill in the missing parts. However the --merge option requires that there NOT be a book directory in the output directory. Because we are adding pages we can't guarantee correct order and numbering unless it starts fresh each time.

    Also any MODS file must match the filename WITHOUT the numeric extension.

    ie. MyTitle1.tiff -> MyTitle.mods

    Assuming an "INPUT" directory containing 3 files each with 10 pages

    INPUT/
          MyBook1.tiff
          MyBook2.tiff
          MyBook3.tiff
          MyBook.mods

    we process them with

    ./multipage2book.py INPUT --output-dir=OUTPUT --merge --skip-derivatives
    Warning: merge attempts to combine multiple files that start with the same name and end with a digit before the extension. Files are sorted by the number and require an empty starting directory. If the expected directory contains files, it will halt with a warning.
    Press any key to proceed

    The output directory would look like

    OUTPUT/
          MyBook_dir/
                     MODS.xml
                     1/
                       OBJ.tiff
                       MODS.xml
                     2/
                       OBJ.tiff
                       MODS.xml
                     ...
                     29/
                       OBJ.tiff
                       MODS.xml
                     30/
                       OBJ.tiff
                       MODS.xml

Caveat

The hocrpdf.py class is included in such a way that if you specify a --loglevel level of DEBUG, any searchable PDFs generated will have the text visibly written over the page image. Only use this setting for debugging, never for production.

Other scripts

Along with multipage2book.py there are several support classes that can be run as standalone scripts. These are:

All of these scripts have usage arguments that can be revealed by running them with the -h or --help argument.

Acknowledgements

hocrpdf.py is a modification/rewrite of hocr-pdf from tmbdev.

It has been modified to:

  1. make it a class for inclusion in other code
  2. modifications to the calculation of the word box base
  3. changed from using setTextOrigin() to using setTextTransform() to assign the rotation of the box.
  4. stopped using the included invisible font.
  5. set the font height to match the box height to get better word highlighting.
  6. switched from lxml.etree library to xml.etree library

Maintainer

Jared Whiklo

License