internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
38 stars 9 forks source link

hocr-to-daisy: use internetarchive-deriver-module for scandata #13

Closed scottbarnes closed 2 weeks ago

scottbarnes commented 1 month ago

This commit changes the scandata parsing to use the internetarchive-deriver-module.

It also skips pages that have addToAccessFormats == 'false'.

The English Illustrated Magazine 1884-12: Vol 2 Iss 15 starts on the correct page now, which is to say it includes the title page: image

The consequence is if the title page is a bit gibberish, that gibberish shows up: image image

MerlijnWajer commented 2 weeks ago

(merged as part of PR 15)