tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.06k stars 9.5k forks source link

Two-column document with ordered lists lose numbers #2363

Open james-s-w-clark opened 5 years ago

james-s-w-clark commented 5 years ago

Environment

Current Behavior:

Ordered/unordered lists of growing lengths affect other column + bullet points in two-column image. This is with --psm 1 & -l eng

Input 1: input_2_columns_ol tessDebug_ol

And a slightly different Input 2: inout_2_columns_ul_ol tessDebug_ul_ol

Expected Behavior:

Tesseract should segment the text into two columns, and: 1) identify all the bulletpoint numbers (in both columns), 2) identify the text on lines even with little text (maybe too sparse for recognition?). It seems that 4 characters are needed on a line (but then, the two-line bullet 1. under section 5. should be readable).

Suggested Fix:

I don't have a suggestion for this.

prince1998 commented 3 years ago

Hello :) I need to do the same thing Where you able to find solution for this? Would be really grateful if you could help Kind Regards

james-s-w-clark commented 3 years ago

You could let tesseract treat this as two single-column images by splitting the original image