Strategies for getting accurate checkboxes on documents with Serif Font

majonathany commented 3 years ago

Hi,

I am actually using this as a part of a OCR pipeline I am building in a commercial product. First I want to say thank you for building something that really works and is open source. AWS Textract does checkboxes at 6 cents a page, which is too expensive to be used at the load the project is going for. So thank you for this amazing library!

I was wondering if you had any insight about how to get the right parameters for an image of grainy quality. I have written an OpenCV pipeline that takes PDFs and splits them into images, then rotates them using a deskewing library, applies a crop, and then produces a cropped correctly rotated color image and a correctly rotated BW thresholded image. I am trying to run boxdetect on both color and thresholded images, and facing a few challenges.

I was wondering if you had some general tips on how to ascertain checkboxes only - I am picking up zeros, lowercase N's (especially with Serif fonts), and other things. I rarely get 4 checkboxes which is what there is in the sample image, I sometimes get 3, or 5, or 8, etc. I also confess I don't use the True/False too often, but I love the percentage feature and the cropped matrix of the box as I personally find it very accurate (I notice checked boxes typically are 55% as opposed to 25% black)

Since I have tally up the number of boxes I find and match them to a unfilled reference document and find the checkbox based on the percentage of the region, it is most important to avoid missing true positives, but it would also be nice to not have as many false positives.

Here are some parameters I am using, and here is a reference image (3 checkboxes are found except the one that says en rampant de toitures). In other images, n's and E's are picked:

rotated = { "w": (25, 50), "h": (25, 30), "wh": (0.85, 1.15), "scale": [1.2, 1.0, .9, .7, ], "group": (2, 100), "iterations": 5, } thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 129, 27)

The post-rotated images come in a variety of sizes of 11 by 8.5 sheets with about +/-200 pixels of white padding, so it is difficult for me to have a width or height range, but wh ratio is generally easy, and I could stretch them to a defined width and height. I have actually found the px_threshold (I sometimes use 0.1, or 0.3) to be very helpful. How is that used exactly? Also, what do you think the recommended kernel is for a 3600 by 2600 image? Any help would be appreciated!

majonathany commented 3 years ago

I realize this could be the adaptive threshold that is the problem, not the parameters in this boxdetect library. I didn't understand what the block size was or the value of C, but I think I will experiment with those as well, because the original image is not grainy like this at all.

karolzak commented 3 years ago

Hi @majonathany As for the varying input size of documents I would try to normalize them somehow. If you have white empty borders you can try to use some preprocessing to crop out the actual document from the image and resize it to some uniform size that you will decide to use.

I've spent couple minutes and tried to run BoxDetect over the image you attached. This is the result: And here's the code I used:

import cv2
import matplotlib.pyplot as plt
%matplotlib inline

file_path = '104635593-1a26ac00-5670-11eb-8ac4-52912968b355.png'

input_image = cv2.imread(file_path)

from boxdetect import config

cfg = config.PipelinesConfig()

# important to adjust these values to match the size of boxes on your image
cfg.width_range = (26, 40)
cfg.height_range = (26, 40)

# the more scaling factors the more accurate the results but also it takes more time to processing
# too small scaling factor may cause false positives
# too big scaling factor will take a lot of processing time
cfg.scaling_factors = [0.5, 0.8]

# w/h ratio range for boxes/rectangles filtering
cfg.wh_ratio_range = (0.90, 1.1)

# range of groups sizes to be returned
cfg.group_size_range = (1, 1)

# for this image we will use rectangles as a kernel for morphological transformations
cfg.morph_kernels_type = 'lines'  # 'lines'

# num of iterations when running dilation tranformation (to engance the image)
cfg.dilation_iterations = 1
cfg.dilation_kernel = (2,2)

from boxdetect.pipelines import get_checkboxes

checkboxes = get_checkboxes(
    input_image, cfg=cfg, px_threshold=0.1, plot=False, verbose=True)

from boxdetect.img_proc import draw_rects, get_image
import matplotlib.pyplot as plt
%matplotlib inline

out_img = draw_rects(get_image(file_path), checkboxes[::,0], thickness=3)

plt.figure(figsize=(15,20))
plt.imshow(out_img)
plt.show()

As you may have noticed I was only able to detect 3 out of 4 boxes. After taking a closer look at the image I realized it's because that one ticked checkbox is missing some pixels and thus it's not recognized (even after boxdetect does it's enhancements).

As for detecting multiple boxes with varying sizes you can check out autoconfig functionality: https://github.com/karolzak/boxdetect/blob/master/notebooks/get-started-autoconfig.ipynb

majonathany commented 3 years ago

Ah I see, thank you for the responsiveness and the help, I am going to try out these parameters. I realized that the part that seems to have a large effect is both the blurring stage and the adaptive threshold. I notice with Serif fonts they tend to carry more thickness horizontally than vertically, so I am going to try messing with that in OpenCV and see if I can make the text clearer, and I might also in this project (not in this repo), alter the functions slightly somehow to account for these discrepancies. Checkboxes are really the main thing that I seem to be stuck on, in terms of reading and analyzing documents and forms. I didn't state this but I do have access to the original images which are significantly higher quality in color, but I might not be able to achieve with the thresholding unless I implement a special algorithm. By the way, thank you for your help, I will close this issue.

karolzak / boxdetect

Strategies for getting accurate checkboxes on documents with Serif Font #13