jacklicn / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Paragraph markup in hocr output not correct #536

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

while playing around with tesseracts hocr output i noticed that 
the detection of paragraph seems a bit off. I think this is issue is really 
minor since most programs which use hocr output rely on the box data. But in my 
case i want to render the resulting text on a small screen. So box data is not 
really useful but the information about paragraphs in the hocr markup is.

What steps will reproduce the problem?
1.scan a page with text which has a paragraph starting with an indent
2.review hocr output

What is the expected output? What do you see instead?

A paragraph following a text indent is correctly recognized but the textline 
following the first line in the paragraph is also denoted as a new paragraph

What version of the product are you using? On what operating system?
svn trunk, Version 3.01

Please provide any additional information below.

example:
suppose you have three text rows:
1. AAAAAAAAAAAAAAAAAAAAAAAAA
2.     BBBBBBBBBBBBBBBBBBBBB
3. CCCCCCCCCCCCCCCCCCCCCCCCC

now during hocr generation each row and its succesor gets passed to 
IsParagraphBreak(..). So IsParagraphBreak(row1,row2) says that there is a 
paragraph break but IsParagraphBreak(row2,row3) also detects a paragraph break; 
the same paragraph is detected twice.

One possible solution could be that
IsParagraphBreak(..) should not be called for each row but only for odd rows.

Original issue reported on code.google.com by renard.w...@googlemail.com on 19 Aug 2011 at 11:14

GoogleCodeExporter commented 9 years ago
please check current svn revision (729) - there is improved paragraph 
segmentation and hocr output.

Original comment by zde...@gmail.com on 29 May 2012 at 9:14

GoogleCodeExporter commented 9 years ago
3.02 works for me (see attachments)

Original comment by zde...@gmail.com on 30 Jul 2012 at 8:11

Attachments: