oliveiracwb / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

PATCH: output form feed control character between pages #1417

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Attached is another very trivial patch that the project may find useful.

We have found that during post-processing of tesseract output text, it can be 
very helpful to have the form feed (page break) control character present at 
the end of a page.

This patch adds a configuration parameter called "include_formfeed_pagebreaks" 
which enables this behavior (for TessTextRenderer only... seemed like hOCR and 
box already contained page number metadata, and I don't know what UNLV text 
is.).

I'm also including a sample tiff image and the output with the parameter 
disabled (the default behavior) and enabled.

Discussion:
https://groups.google.com/d/msg/tesseract-dev/VsgJ9R-cTQ0/OMeDjYWoAdQJ

Original issue reported on code.google.com by zde...@gmail.com on 30 Jan 2015 at 9:37

Attachments:

GoogleCodeExporter commented 9 years ago
fixed in 4c7c960bfd57

Original comment by zde...@gmail.com on 7 Feb 2015 at 9:23