Handle image and other non-text regions in output formats

stweil commented 2 years ago

Internally Tesseract detects different kinds of regions, not only text regions.

Currently regions for images and horizontal or vertical lines are also written to ALTO, hOCR and text output as paragraphs, lines and (empty) words which unnecessarily increases the output file size and hides the relevant information.

For text files such regions should be skipped.

For ALTO and hOCR that regions are useful, but need the correct representation.

PDF output still has to be examined. Maybe skipping the non-text regions is reasonable there, too.

gunnar-ifp commented 2 years ago

I have been dabbling with PDF for the last year due to work and I am combining PDF and HOCR with some magic code to create marked content tags in the PDF (for headings, paragraphs, I even add the word confidence to each word for later filtering). It would of course be easier if the PDFRenderer did this all directly. I wanted to make a fork and do this to give back to you guys if I have some time. I haven't done C++ in a long time, so the hurdle is a bit high.

This allows for simple text extraction w/o layout analysis and helps with screen readers and such. You can give the language and even the text direction (arabic text) as well as font hints. The tags alone don't hurt anybody and can be added without much work, but one needs a "tagged PDF" for this to work "officially" and I have been looking into this, too. All it would need is a structure tree root and probably add MCIDs to the root level tags on each page and add these to the strutcture tree. Once this step has been reached, adding image references to the structure tree root is the next step.

I really would like that, my plan is to selectively compress the PDF like commercial tools do, where you use oversampled b/w image masks for the text and store the images areas as pictures. This way you can really reduce file size a lot.

For now I might simply use the java binding for testing this, once I get this figured out there should be a way back to tesseract. If it is stored in the HOCR, I could extract it from there and wouldn't need to go the java binding route.

As said, the hurdle with C++ is a bit high, like what container classes to use...

stweil commented 2 years ago

Non-text regions are now handled for text, ALTO and hOCR output (see commit 424b17f997363670d187f42c43408c472fe55053).

TSV ~and PDF~ output still has to be done.

amitdo commented 1 year ago

TSV and PDF output still has to be done.

PDF: #3959.

tesseract-ocr / tesseract

Handle image and other non-text regions in output formats #3715