Closed johtso closed 2 years ago
For future reference, this is indeed something we're working on integrating in the archive.org OCR stack.
I think the archive-hocr-tools should already just support the tags, are you interested in specific library features to find photos for a given hOCR page result?
As far as I can tell, the way things are set up the hOCR files generated don't include any layout information on the location of images in a document.
From the spec it looks like this should be possible? http://kba.cloud/hocr-spec/1.2/#floats-image
The previous OCR approach using ABBYY did produce picture features, and this allowed some really exciting things like programatically extracting and exploring images from books, which then resulted in the Internet Archive Book Images project, something that wouldn't really be feasible if you had to download every book page image just to check if it contained any illustrations.
Is there a reason why this features aren't included, or is this something that just needs to be enabled?
Edit:
Ah, I think I might be in the wrong place, I thought this repository related to the generation of the hocr files. Does that stuff live somewhere else?
Edit2:
Found it: https://git.archive.org/www/tesseract