ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

Noise removal for grayscale images #289

Open zuphilip opened 6 years ago

zuphilip commented 6 years ago

In the page segmentation there is a step for removing noise, but for the grayscale line images (option --gray) this noise removal is neglected:

https://github.com/tmbdev/ocropy/blob/8cfce574dd0d3a3ad653494f604ed57d1c775241/ocropus-gpageseg#L444-L451

I think that the function remove-noise can also handle grayscale images, but it will always output a cleaned binary image. How can we use that to do the same cleaning in the grayscale image?

mittagessen commented 6 years ago

remove_noise handles grayscale images by binarizing them at 0.5 and then removing every connected component smaller than 8 pixels. An unevenly lighted image or even just lightly colored printing will be unusable after that process.

A short literature review shows a large number of grayscale despeckling algorithms (mainly for ultrasounds and SAR) that might be more useful albeit probably computationally expensive. Also speckling seems to be mostly binarization artifacts, so I'm unsure if it will improve accuracy to clean grayscale images.