jlsutherland / doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
MIT License
1.27k stars 97 forks source link

Get an homogeneous background for better thresholding results #13

Open remi-pr opened 8 years ago

remi-pr commented 8 years ago

The idea is to get a flat and white background before thresholding. For this a matrix of offsets is added to the image. This matrix is optimized is order to

The optimization quality is not so critical, it does not need to converge (given the default parameters) to provide nice results. This can certainly be tuned further.

This is done on a down-scaled version of the image (for speed reasons). The matrix is then up-scaled to be applied to the original image. It has been tested only on a few images where it showed promising results.

The added code is almost completely separate from the original code: 3 more functions in page.py and one function call before thresholding in process_skewed_crop

jlsutherland commented 8 years ago

Thanks for this @remi-pr! Very interesting. I can't wait to try it out and merge this in.

An initial thought off the top of my head is version compatibility. We've seen a few issues with opencv versioning, e.g. #6. We want the package to work on both v2 and v3 opencv.

What version of opencv are you working with?

remi-pr commented 8 years ago

That is a very good point @jlsutherland .

I indeed work with a quite old version of OpenCV (2.4.6.1). Which is why I added the named argument to your cv2.resize call which makes it compatible with both 2.x and 3.x versions.

The only cv2 call I use in the added code is cv2.Laplacian which has not changed in quite a while. I checked on a 3.0.0 version of OpenCV and the signature is the same and cv2.CV_64F also exists.

jlsutherland commented 8 years ago

Awesome. Thank you for looking into this. Testing in a few minutes.