find out text regions - Githubissues

jbarth-ubhd / fix-perspective

MIT License

3 stars 1 forks source link

find out text regions #3

Open jbarth-ubhd opened 2 years ago

jbarth-ubhd commented 2 years ago

opencv has detectTextSWT http://www.math.tau.ac.il/~turkel/imagepapers/text_detection.pdf , perhaps I could limit the "optimal angle search" for the text area.

jbarth-ubhd commented 2 years ago

detectTextSWT is a bit slow. Perhaps bluring the image horizontally and average that image vertically could give a hint where main text is, but in case of skewed crop near text (as in https://user-images.githubusercontent.com/38561704/163437862-2691812d-611d-4824-8072-83104f5a2ef9.png on the left side) that wouldn't help

jbarth-ubhd commented 2 years ago

with this few lines I've got nearer to detect, where the text is:

    Mat wim=white_on_black;
    // size(im)==size(wim)
    blur(wim, findtext, Size(34, 1)); 
    // "ern" (12 pt., font Garamond No 8) = 68 px @ 300 dpi
    // but imread("img size / 2")

    resize(findtext, findtext, Size(im.cols, im.rows*2));
    // *2 because medianBlur has kernel size(x, x)
    // but I want size(x, x/2)

    threshold(findtext, findtext, 0, 255, THRESH_OTSU);

    medianBlur(findtext, findtext, 5);
    if(debug) resize(findtext, findtext, im.size(), 0, 0, INTER_AREA); // it will generally look best with INTER_AREA interpolation

→ grafik

Would have to check typical line hights by scanning columns.

bertsky commented 2 years ago

Good idea – but mind that this alley will lead to assumptions on the typesetting / page layout of the input. I have seen the same in https://github.com/mzucker/page_dewarp. This usually does not work with pages that have lots of tables or images in them, or text scattered across in single lines. Also, you might have to depend on a good binarization for the OpenCV function to work properly (think heavy shine-through or bleeding or drying-out).

jbarth-ubhd commented 5 months ago

detectTextSWT has limitations... negative examples:

left words in block not recognized (not only the bold ones): drouot1943_01_18_-_05

only very small "dust" regions recognized: franz1759b_-_g

poetting1674_-_0018

large horizontal+vertical lines not recognized: mess_marckt_helffer1738_-_0187