Cuneiform: Split text areas before OCR

openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab

https://gitlab.gnome.org/World/OpenPaperwork/pyocr

930 stars 152 forks source link

Cuneiform: Split text areas before OCR #2

Open jflesch opened 12 years ago

jflesch commented 12 years ago

Cuneiform tends to stop reading pages when it reachs a large non-readable area. Because of this, when using Cuneiform, all the keywords are not actually extracted.

A way to work around this problem would be to split the text areas prior to OCR.

For instance, unpaper can do that (ocrfeeder uses it).

jflesch commented 7 years ago

Note : Image processing algorithm should be added in https://github.com/jflesch/libpillowfight/ , not in PyOCR directly.