Closed JamesOwers closed 9 years ago
You can use the C-API to only retrieve the page segmentation without doing character recognition. Use TessBaseAPISetPageSegMode to set the segmentation mode, call TessBaseAPIProcessPages, and finally retrieve the page iterator using TessBaseAPIAnalyseLayout. Iterate using the TessPageIteratorNext function at the lowest level and check with TessPageIteratorIsAtBeginningOf if the current symbol is at the start of a new block. All in all it shouldn't be more than a few lines of C code and you're skipping the recognition part of tesseract completely.
For support please use tesseract-ocr user forum. See FAQ[1]
[1] https://github.com/tesseract-ocr/tesseract/wiki/FAQ#rules-and-advice
@zdenop thank's for clarifying. Here is the link to my forum post (which contains another answer): https://groups.google.com/forum/#!topic/tesseract-ocr/1Frh-5ggNxg
This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.
@jimregan Cheers! I'm reproducing your answer on the linked forum page (the preferred help location).
That seems a bit redundant; I was merely summarising what you were told there :)
I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).