Two-column document with ordered lists lose numbers

james-s-w-clark commented 5 years ago

Environment

4.0.0:
Platform: Ubuntu 16.04 Xenial

Current Behavior:

Ordered/unordered lists of growing lengths affect other column + bullet points in two-column image. This is with --psm 1 & -l eng

Input 1: input_2_columns_ol tessDebug_ol

And a slightly different Input 2: inout_2_columns_ul_ol tessDebug_ul_ol

Expected Behavior:

Tesseract should segment the text into two columns, and: 1) identify all the bulletpoint numbers (in both columns), 2) identify the text on lines even with little text (maybe too sparse for recognition?). It seems that 4 characters are needed on a line (but then, the two-line bullet 1. under section 5. should be readable).

Suggested Fix:

I don't have a suggestion for this.

prince1998 commented 3 years ago

Hello :) I need to do the same thing Where you able to find solution for this? Would be really grateful if you could help Kind Regards

james-s-w-clark commented 3 years ago

You could let tesseract treat this as two single-column images by splitting the original image

tesseract-ocr / tesseract