ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

can ocropy give classification of different type of text line? #317

Closed DamonsJ closed 5 years ago

DamonsJ commented 5 years ago

suppose a document page contains text ,math equation and image, can ocropy identify which block is text, which block is math equation, which block is image?

may be there is table in document page ? Is there any solution?

Thanks

zuphilip commented 5 years ago

There is currently no text/image classification in ocropus, also this was discussed before and implemented in a previous version, see #38. The different columns in a table might be detected (especially if there are given black separators between them). Nothing about detection of mathematical formulas.

kba commented 5 years ago

Not part of ocropus, ocropus does line detection with a few heuristics / knobs to turn to avoid lines bleeding across columns or inadvertently merging lines etc.

Have a look at dhSegment, for a semi-automatic solution check out LAREX (they are working on a trainable pixel classifiier as well IIRC) or the Leptonica toolset which tesseract uses.

kba commented 5 years ago

For completeness sake or the curious: https://github.com/tmbdev/ocropy/wiki/OCRopus-File-Formats#physical-layout

amitdo commented 5 years ago

https://github.com/tesseract-ocr/tesseract/blob/b502bbf58e78/src/ccstruct/publictypes.h#L53 https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c/src/ccmain/pageiterator.h#L225