internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

DAISY: remove Internet Archive item dependency #15

Closed scottbarnes closed 2 months ago

scottbarnes commented 2 months ago

Depends on #13. The relevant changes for this particular PR are in e365988f80b8fdbb2993eb6f3e43c3243124158b.

To the extent IA item metadata is available, it will be used, but now the minimum requirement is an hOCR file and an output file.

Minimum invocation:

❯ PYTHONPATH=. ./bin/hocr-to-daisy -f ./sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
-o test_daisy_output.zip

https://archive.org/details/sim_english-illustrated-magazine_1884-12_2_15 image image

(Nearly) Maximal invocation (without TOC):

❯ PYTHONPATH=. ./bin/hocr-to-daisy -f /home/scott/Downloads/daisy/items/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
-m /home/scott/Downloads/daisy/items/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15_meta.xml \
-s /home/scott/Downloads/daisy/items/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15_scandata.xml \
-o test_daisy_output.zip

image image

It's not pictured, but the (nearly) maximal option also includes page numbers in the DAISY, where as the minimal option does not include these.