Closed MerlijnWajer closed 2 years ago
For the hOCR words binarisation, we could perhaps also attempt some other (fast) compression (maybe zlib) instead of the noise estimation (we'll have to trim the headers).
Not needed, we have https://github.com/internetarchive/archive-pdf-tools/issues/24
We can perform threshold on the original image, optimistically do the JBIG2 conversion, and only when the JBIG2 doesn't compress well, we either apply blur to the image and re-threshold, and/or denoise the threshold result (mask).
JBIG2 compression is fast, and our current noise estimation is not. Since our JBIG2 is lossless, good compression suggests that the image is not noisy.
This will help up speed up the PDF generation, since the Gaussian noise estimation is currently the most CPU intensive part, which is kind of silly.