tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.26k stars 9.51k forks source link

too many "No block overlapping textline" #1058

Open honeytidy opened 7 years ago

honeytidy commented 7 years ago

I want generate some training data for tesseract: tesseract tiff/data.tif testdata/data lstm.train langdata/chi_sim/chi_sim.config But I got a lots of the following message, it happens almost on loading every page:

Page 54 Warning. Invalid resolution 1 dpi. Using 70 instead. Loaded 603/603 pages (1-603) of document testdata/all.lstmf No block overlapping textline: 2017探索 No block overlapping textline: 主持人阵容 No block overlapping textline: 序号 No block overlapping textline: 节目名称 No block overlapping textline: 节目类型 No block overlapping textline: 演出部门 ......

My tesseract version: tesseract 4.00.00alpha leptonica-1.74.4 Any sugguestion for this?

Shreeshrii commented 7 years ago

Please see this note in wiki - https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

No block overlapping textline: occurs when layout analysis fails to correctly segment the image that was given as training data. The textline is dropped. Not much problem if there aren't many, but if there are a lot, there is probably something wrong with the training text or rendering process.

Shreeshrii commented 7 years ago

Still getting some of these errors for Devanagari, with tif/box pairs generated by text2image. Seems to be around ---------०--------- in training text.


No block overlapping textline: ---------०---------
No block overlapping textline: वित्त्येवहि अचूर्यामहि कृतघ्नं शत्रून्द्रुहे शुष्कीकरोति
No block overlapping textline: अर्कैः
No block overlapping textline: ह्यन्बन्त्यांञ्जगृहीतवती शक्तिपीठं ग्न्य छन्दष्ट्य झ