Improve mask and background generation

internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.

https://archive-pdf-tools.readthedocs.io/en/latest/

GNU Affero General Public License v3.0

86 stars 13 forks source link

Improve mask and background generation #8

Open MerlijnWajer opened 2 years ago

MerlijnWajer commented 2 years ago

There a few things to improve in the mask generation:

The Sauvola binarisation currently uses fixed parameters, which is not ideal. We probably want to make some of those parameters dependent on the image DPI, and change the k value to 0.34 as default.
We could look into better binarisation algorithms like multi-scale sauvola as mentioned here: https://github.com/tesseract-ocr/tesseract/issues/3083#issuecomment-916280410

The same applies to the hocr-specific mask generation.

MerlijnWajer commented 2 years ago

This would probably also help with the text backdrop/shade on the background generation - if we improve the mask generation, that should probably start working better as well.

MerlijnWajer commented 2 years ago

In particular, the scribo implementation(s) could be helpful: https://github.com/OCR-D/olena/tree/master/scribo/scribo

ocrd/olena:latest contains scribo-cli but also its OCR-D wrapper ocrd-olena-binarize (which uses bash and xmlstarlet for all the METS/PAGE-XML interfacing) and is ~300MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/Dockerfile)
ocrd/olena:build-olena contains scribo-cli only and is ~100MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/build-olena.dockerfile)

MerlijnWajer commented 2 years ago

I've been working on implementing supporting scribo (https://github.com/OCR-D/olena/tree/master/scribo/scribo) binarisation methods and in particular looked at the intersection of singh and wolf/sauvola_ms, since singh doesn't seem to produce as fat letters as most other algorithms do. With that in place, the foreground text is definitely better colour wise, and also more sharp, but the background has more artifacts, since the borders of the text are not removed from the background.

This makes me wonder if it makes sense to introduce some third layer (not in the final result), which contains the text borders and other pixels that are (for example) not in singh but are in wolf. We would then ultimately place those pixels in the foreground image, but not use them when creating the 'smoothed' foreground image. But we would use them (the 'second' mask) when smoothing out the background.

MerlijnWajer commented 2 years ago

diff

After messing around a bit, adding the 'extra' temporary layer, I think the background generation has gotten quite a bit better. The mask is creating using Singh's algorithm, with any other binarisation mixed in to filter out some noise, and then the hOCR layer is mixed in with Singh to find the borders/backdrop of text. Will push code in the next few days.

(E: Left is new, right is old)

MerlijnWajer commented 2 years ago

diff2

Another example with a large newspaper, left is old, right is new.

MerlijnWajer commented 2 years ago

We could/should also consider wapping out the sauvola algorithm for the sauvola_ms algorithm with the right window size -- that might further improve quality and compress masks.

MerlijnWajer commented 2 years ago

I have improved the background generation significantly in a more simple manner in this commit: 3cbcc90fb27eba5b8acd13338089a79bc5a835bd

It doesn't make the images also sharper, but arguably that shouldn't happen if they weren't sharp to begin with. Leaving this open for now, but much of "shade backdrop" problems are gone.

rmast commented 2 years ago

Have you seen the OCRD-project that contains lots of binarizations? Ocr4all tries to use it in an upcoming edition.

Gamera 4 is also providing some binarization algorithms, for example an incomplete DjVu-binarization dat doesn't contain the software patent that's going to end in two months that helps inverting white on black parts in the binarized image.

MerlijnWajer commented 2 years ago

Yes, I'm in pretty good contact with the OCR-D people and the branch linked here uses various algorithms from the OCR-D folks. I ended up going with just Sauvola recently because my implementation it's that much faster than basically anything in the scribo toolbox, and performance matters for archive.org since we deal with millions of pages a day.

I'm happy to add support for alternative binarisation methods, btw. What I have found however that it's really hard to find a 'one size fits all' approach, and the current code 1.4.9 is on par or often even better than the commercial foxit compression.