tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.1k stars 9.5k forks source link

Page segmentation output ocr_float #42

Closed JamesOwers closed 9 years ago

JamesOwers commented 9 years ago

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

  1. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
  2. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
  3. Breuel (2010) The hOCR Embedded OCR Workflow and Output Format
mittagessen commented 9 years ago

You can use the C-API to only retrieve the page segmentation without doing character recognition. Use TessBaseAPISetPageSegMode to set the segmentation mode, call TessBaseAPIProcessPages, and finally retrieve the page iterator using TessBaseAPIAnalyseLayout. Iterate using the TessPageIteratorNext function at the lowest level and check with TessPageIteratorIsAtBeginningOf if the current symbol is at the start of a new block. All in all it shouldn't be more than a few lines of C code and you're skipping the recognition part of tesseract completely.

zdenop commented 9 years ago

For support please use tesseract-ocr user forum. See FAQ[1]

[1] https://github.com/tesseract-ocr/tesseract/wiki/FAQ#rules-and-advice

JamesOwers commented 9 years ago

@zdenop thank's for clarifying. Here is the link to my forum post (which contains another answer): https://groups.google.com/forum/#!topic/tesseract-ocr/1Frh-5ggNxg

jimregan commented 9 years ago

This issue is currently the top search result for 'ocr_float'; it lacks a simple summary: Tesseract (currently) does not support ocr_float.

JamesOwers commented 9 years ago

@jimregan Cheers! I'm reproducing your answer on the linked forum page (the preferred help location).

jimregan commented 9 years ago

That seems a bit redundant; I was merely summarising what you were told there :)