internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

DAISY: only add Arabic and Roman numerals to navigation #14

Closed scottbarnes closed 2 months ago

scottbarnes commented 2 months ago

This commit ignores non-Roman numeral, non-Arabic page numbers and treats them like normal text.

In doing so it fixes the following:

2024-08-17 22:28:23,333 INFO     python-derivermodule version: 1.0.25; hocr version: 1.1.61; and entrypoint version: 1.0.1.
2024-08-17 22:28:23,333 INFO     sourceFile: '/item/DTIC_ADA040218_abbyy.gz' -> targetFile: '/var/tmp/tmp/generated/DTIC_ADA040218/tmp_daisy.zip'
2024-08-17 22:28:23,357 INFO     converting /item/DTIC_ADA040218_abbyy.gz to hocr
2024-08-17 22:28:27,628 INFO     successfully converted /item/DTIC_ADA040218_abbyy.gz to hocr (/tmp/tmp.hocr.html)
2024-08-17 22:28:27,628 INFO     converting /tmp/tmp.hocr.html to daisy
2024-08-17 22:28:27,978 INFO     Failure while parsing zip iabook: Traceback (most recent call last):
  File "/usr/local/bin/hocr-to-daisy", line 467, in
    dg.process_book_hocr(ebook=daisy_book)
  File "/usr/local/bin/hocr-to-daisy", line 331, in process_book_hocr
    ebook.add_pagetarget(pageno, pageno)
  File "/usr/local/lib/python3.12/site-packages/hocr/daisy/book.py", line 213, in add_pagetarget
    raise ValueError(error_text)
ValueError: Got non-Arabic, non-Roman numeral, or negative pagetarget value

Note, whereas previously a page was featured in navigation, now it just shows up as text, on the theory this stays vaguely consistent with how pages were being presented before, insofar as they were presented, and not dropped. But perhaps being dropped is preferred.

Some screenshots to illustrate this.

Page 40 ending, and page 41 starting, with Arabic numerals: image

Now in the appendix, pages look like A-1, A-2, etc. Page A-2 ending, and page A-3 starting: image