internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
97 stars 13 forks source link

Use JBIG2 compression to determine if we want to blur or denoise before thresholding #13

Closed MerlijnWajer closed 2 years ago

MerlijnWajer commented 3 years ago

We can perform threshold on the original image, optimistically do the JBIG2 conversion, and only when the JBIG2 doesn't compress well, we either apply blur to the image and re-threshold, and/or denoise the threshold result (mask).

JBIG2 compression is fast, and our current noise estimation is not. Since our JBIG2 is lossless, good compression suggests that the image is not noisy.

This will help up speed up the PDF generation, since the Gaussian noise estimation is currently the most CPU intensive part, which is kind of silly.

MerlijnWajer commented 3 years ago

For the hOCR words binarisation, we could perhaps also attempt some other (fast) compression (maybe zlib) instead of the noise estimation (we'll have to trim the headers).

MerlijnWajer commented 2 years ago

Not needed, we have https://github.com/internetarchive/archive-pdf-tools/issues/24