internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

Add detection of image/graphics features #4

Closed johtso closed 2 years ago

johtso commented 2 years ago

As far as I can tell, the way things are set up the hOCR files generated don't include any layout information on the location of images in a document.

From the spec it looks like this should be possible? http://kba.cloud/hocr-spec/1.2/#floats-image

The previous OCR approach using ABBYY did produce picture features, and this allowed some really exciting things like programatically extracting and exploring images from books, which then resulted in the Internet Archive Book Images project, something that wouldn't really be feasible if you had to download every book page image just to check if it contained any illustrations.

Is there a reason why this features aren't included, or is this something that just needs to be enabled?

Edit:

Ah, I think I might be in the wrong place, I thought this repository related to the generation of the hocr files. Does that stuff live somewhere else?

Edit2:

Found it: https://git.archive.org/www/tesseract

MerlijnWajer commented 2 years ago

For future reference, this is indeed something we're working on integrating in the archive.org OCR stack.

I think the archive-hocr-tools should already just support the tags, are you interested in specific library features to find photos for a given hOCR page result?