ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.16k stars 1.02k forks source link

Introduce a way to radically reduce the output file size (sacrificing image quality) #541

Closed heinrich-ulbricht closed 11 months ago

heinrich-ulbricht commented 4 years ago

Is your feature request related to a problem? Please describe. My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.

I describes this in more detail here: https://github.com/jbarlow83/OCRmyPDF/issues/443#issuecomment-618589203

Furthermore I see a discussion covering a similar topic here: #293

Describe the solution you'd like I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):

Additional context I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:

  1. let OCRmyPDF do it's thing on high quality images/PDFs; post-process manually using pikepdf using a Python script that replaces the high quality images with low quality ones in the PDF (I have a working PoC, but it's not pretty)
  2. modify OCRmyPDF

I'm not sure about the second approach - where would be a good point to start? One approach could be:

  1. using PNG images in the input PDF file, then
  2. forcing pngquant to convert them to 1 bpp (here?)
  3. this could trigger PNG rewriting as G4 (here)

@jbarlow83 Does this sound right?

jbarlow83 commented 4 years ago

I would go with modifying ocrmypdf, and:

  1. Always input JPG
  2. Replace pngquant.quantize with code that always converts the image to 1bpp (e.g. just use PIllow).
  3. You will actually want to install jbig2enc. JBIG2 outperforms G4 in size and is still widely supported. 1bpp PNGs will always be converted to JBIG2 when a jbig2 encoder is available. You might even want JBIG2 in lossy mode, provided the dangers of lossy mode are acceptable to you (see documentation and the "6-8" problem).

Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway.

heinrich-ulbricht commented 4 years ago

I'm giving it a try and am having some success.

@jbarlow83 A question: This return doesn't look right since it leaves the function after handling only one image. Is this ok?

https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L426

For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.)

And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:

Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug?

Note: I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

(But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.)

Edit: the debug output shows me this command line that is being used by OCRmyPDF:

  DEBUG - Running: ['jbig2', '-b', 'group00000000', '-s', '-p', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000032.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000028.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000030.tif']

The TIF files look good.

heinrich-ulbricht commented 4 years ago

I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still /DeviceRGB: image But correct would be /DeviceGray.

I was able to quick-fix this by inserting im_obj.ColorSpace = Name("/DeviceGray") right before this line: https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L340 The PDF now looks good.

Hypothesis: it was never intended to change the color space during image optimization?

Edit: Suggested fix:

if (Name.BitsPerComponent in im_obj and im_obj.BitsPerComponent == 1):
  log.debug("Setting ColorSpace to /DeviceGray")
  im_obj.ColorSpace = Name("/DeviceGray")

Edit2: Better fix? Add im_obj.ColorSpace = Name("/DeviceGray") here: https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L430

heinrich-ulbricht commented 4 years ago

I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF.

The shell script that works for me looks like this:

#!/bin/sh
convert -colorspace gray -fill white -sharpen 0x2 "$1" - | jpegtopnm | pamthreshold | pamtotiff -g4 > "$2"

This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. pamthreshold implements a nice dynamic threshold.

The command that I used to test looks like this:

ocrmypdf --user-script-jpg-to-1bpp-tif shell.sh --jbig2-lossy -v 1 -O3 in.pdf out.pdf

I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point.

jbarlow83 commented 4 years ago

I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.

You are correct, those returns are wrong and will suppress multiple images per file. That's a great catch.

Hypothesis: it was never intended to change the color space during image optimization?

Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.

I'm not opening a pull request since the solution

Agreed - that's a lot of new dependencies to add.

andersjohansson commented 4 years ago

I also needed exactly this!

I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10

It works fine now though! Thanks!

jbarlow83 commented 4 years ago

userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to pluginspec.py, and then we could have a generic, pluggable interface for people who want to optimize images more aggressively.

andersjohansson commented 4 years ago

If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: https://github.com/andersjohansson/OCRmyPDF/commit/4e5b68f1b966312edeba8ef3b6e12037bac8aef6

rmast commented 3 years ago

What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:

https://github.com/jbarlow83/OCRmyPDF/issues/836#issuecomment-922560147

(We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)

You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big.

With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group.

However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested.

https://github.com/jwilk/didjvu/issues/18

jbarlow83 commented 3 years ago

@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks.

MerlijnWajer commented 2 years ago

(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing.

jbarlow83 commented 2 years ago

@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it.

But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable.

MerlijnWajer commented 2 years ago

I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode)

jbarlow83 commented 2 years ago

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

MerlijnWajer commented 2 years ago

It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.

Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done.

rmast commented 2 years ago

@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: https://github.com/ocrmypdf/OCRmyPDF/issues/9 https://github.com/fritz-hh/OCRmyPDF/issues/88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

blaueente commented 2 years ago

@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: #9 fritz-hh/OCRmyPDF#88

I understand license-(in)compatibility is inhibiting progress.

I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.

Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways. C44 / iw44 is the wavelet codec used by didjvu, and therefore unusable for PDF MRCs. The ideas of archive-pdf-tools seem pretty good to me, maybe they could learn from gamera's separation algorithms, and the ROI-style coding of iw44, although I see good discussions in their github page.

Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?

Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries.

MerlijnWajer commented 2 years ago

I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on?

MerlijnWajer commented 2 years ago

OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf)

example.tar.gz

The text layer and document metadata seems untouched, and pdfimages output seems sensible:

$ pdfimages -list /tmp/ocrmypdf.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpeg   no        12  0   762   762  635K 2.5%

$ pdfimages -list /tmp/out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2472  3484  rgb     3   8  jpx    no        16  0   762   762 12.8K 0.1%
   1     1 image    2472  3484  rgb     3   8  jpx    no        17  0   762   762 62.6K 0.2%
   1     2 smask    2472  3484  gray    1   1  jbig2  no        17  0   762   762 41.9K 4.0%

Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing?

MerlijnWajer commented 2 years ago

VeraPDF also doesn't seem to complain:

~/verapdf/verapdf --format text --flavour 2b /tmp/out.pdf
PASS /tmp/out.pdf
MerlijnWajer commented 2 years ago

Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back):

https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script)

The only argument is the input pdf, and then it will save the compressed PDF to /tmp/out.pdf. You will need archive-pdf-tools==1.4.13 installed (available via pip). Depending on which code is commented it can compress JPEG2000 using Pillow, JPEG using jpegoptim, or JPEG2000 using kakadu.

If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up.

As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases?

MerlijnWajer commented 2 years ago

One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that.

rmast commented 2 years ago

I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table.

I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure

sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip
pip install archive-pdf-tools==1.4.13
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git

git clone https://github.com/agl/jbig2enc.git
wget https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py

cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install

cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install
MerlijnWajer commented 2 years ago

Right, the current code is also inferior to what the normal tooling does since that uses the text layer info as well, but once I add that (I will try to do that soon), it could be better.

DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.

rmast commented 2 years ago

DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.

That's where DjVuToy comes in, that converts the DjVu-result of DjVuSolo3.1 to a JBIG2/JPEG2000 PDF of 46kb. The DjVu itself is only 31kb.

MerlijnWajer commented 2 years ago

I can't find the source for that program. Is it free software? (If not: maybe another issue/place would be better to discuss that?)

rmast commented 2 years ago

No, both are closed source. DjVuSolo3.1 is a very old pre-commercial demo of the capabilities of DjVu. When they commercialized DjVu they rated it at such high prices that DjVu priced itself out of the market. I guess the Internet Archive once used DjVu. DjVuToy is actively maintained by a Chinese enthousiast, but he's not planning on opening the source.

rmast commented 2 years ago

Here the result via DjVuSolo3.1/DjVuToy3.06 unicode edition, half as small as your result from the Covid-health-form:

in.pdf

rmast@Ubuntu20:~$ pdfimages -list in.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     824  1162  rgb     3   8  jpx    yes        1  0   100   101 3080B 0.1%
   1     1 stencil  2472  3484  -       1   1  jbig2  no         3  0   300   300 17.6K 1.7%
   1     2 stencil  2472  3484  -       1   1  jbig2  no         4  0   300   300  347B 0.0%
   1     3 stencil  2472  3484  -       1   1  jbig2  no         5  0   300   300   68B 0.0%
   1     4 stencil  2472  3484  -       1   1  jbig2  no         6  0   300   300 2137B 0.2%
   1     5 stencil  2472  3484  -       1   1  jbig2  no         7  0   300   300  618B 0.1%
   1     6 stencil  2472  3484  -       1   1  jbig2  no         8  0   300   300  984B 0.1%
   1     7 stencil  2472  3484  -       1   1  jbig2  no         9  0   300   300  357B 0.0%
   1     8 stencil  2472  3484  -       1   1  jbig2  no        10  0   300   300 6063B 0.6%
   1     9 stencil  2472  3484  -       1   1  jbig2  no        11  0   300   300  324B 0.0%
   1    10 stencil  2472  3484  -       1   1  jbig2  no        12  0   300   300 11.2K 1.1%
   1    11 stencil  2472  3484  -       1   1  jbig2  no        13  0   300   300  125B 0.0%
   1    12 stencil  2472  3484  -       1   1  jbig2  no        14  0   300   300  114B 0.0%
   1    13 stencil  2472  3484  -       1   1  jbig2  no        15  0   300   300  322B 0.0%
   1    14 stencil  2472  3484  -       1   1  jbig2  no        16  0   300   300  129B 0.0%
   1    15 stencil  2472  3484  -       1   1  jbig2  no        17  0   300   300  246B 0.0%
   1    16 stencil  2472  3484  -       1   1  jbig2  no        18  0   300   300  210B 0.0%
   1    17 stencil  2472  3484  -       1   1  jbig2  no        19  0   300   300  335B 0.0%
   1    18 stencil  2472  3484  -       1   1  jbig2  no        20  0   300   300  194B 0.0%
   1    19 stencil  2472  3484  -       1   1  jbig2  no        21  0   300   300   74B 0.0%
   1    20 stencil  2472  3484  -       1   1  jbig2  no        22  0   300   300  170B 0.0%
   1    21 stencil  2472  3484  -       1   1  jbig2  no        23  0   300   300  349B 0.0%
   1    22 stencil  2472  3484  -       1   1  jbig2  no        24  0   300   300  325B 0.0%
   1    23 stencil  2472  3484  -       1   1  jbig2  no        25  0   300   300  109B 0.0%
   1    24 stencil  2472  3484  -       1   1  jbig2  no        26  0   300   300  139B 0.0%
   1    25 stencil  2472  3484  -       1   1  jbig2  no        27  0   300   300  271B 0.0%
   1    26 stencil  2472  3484  -       1   1  jbig2  no        28  0   300   300  913B 0.1%
   1    27 stencil  2472  3484  -       1   1  jbig2  no        29  0   300   300  138B 0.0%
   1    28 stencil  2472  3484  -       1   1  jbig2  no        30  0   300   300  113B 0.0%
   1    29 stencil  2472  3484  -       1   1  jbig2  no        31  0   300   300  116B 0.0%
   1    30 stencil  2472  3484  -       1   1  jbig2  no        32  0   300   300  117B 0.0%
   1    31 stencil  2472  3484  -       1   1  jbig2  no        33  0   300   300  401B 0.0%
   1    32 stencil  2472  3484  -       1   1  jbig2  no        34  0   300   300  202B 0.0%
rmast@Ubuntu20:~$ ls -al in.pdf
-rw-rw-r-- 1 rmast rmast 58988 May  5 18:00 in.pdf

The many jbig2-pictures stem from all the colors in the JB2-picture. DjVuToy translates those to separate images with their own color.

rmast commented 2 years ago

Especially take a look at the clearness of the background picture...

MerlijnWajer commented 2 years ago

So I've cleaned up the code a bit and am looking for some people to try and run it on their OCRMyPDF results. (Let's not focus on DjVu stuff here please, as I'm trying to make a tool that people can use based on existing/working code)

You'll need this build of archive-pdf-tools: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477636215 (just click on the artifact download link and pick the release for your os/python interpreter from the artifact.zip)

And then download this script: https://archive.org/~merlijn/pdfcomp.py

Use like so:

$ python pdfcomp.py /tmp/ocrmypdf.pdf /tmp/ocrmypdf_comp.pdf
Compression factor: 5.193651663405088

Some random notes...

$ grep -a Tess /tmp/ocrmypdf_comp.pdf
  /Creator (ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3)
<xmp:CreatorTool>ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3</xmp:CreatorTool></rdf:Description>
MerlijnWajer commented 2 years ago

(Sorry for the noise, another build for folks who don't have kdu_expand and kdu_compress)

I added another build here that doesn't rely on kakadu for JPEG2000, but rather on Pillow having JPEG2000 support: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477739599

Usage is the same. The only external requirement now that I know of is https://github.com/agl/jbig2enc - I can also build a version that doesn't need that either, but that will come at a compression cost. (I can make these flags to compress-pdf-images in the near future)

rmast commented 2 years ago

Again on a fresh Ubuntu 20 LTS image on Azure, and took the second download, as the first crashed:

sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip tesseract-ocr-nld
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
git clone https://github.com/agl/jbig2enc.git
cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install
cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
cd ..
ls -al
pip install archive_pdf_tools-1.4.16-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
pip install lxml==4.6.5
wget https://archive.org/~merlijn/pdfcomp.py
ocrmypdf -l nld 'outputbase2-000-raar-effect-onderste-regel-didjvu zonder tekst.pdf' output_pdf
python3 pdfcomp.py output_pdf ocrmypdf_comp.pdf
Compression factor: 3.92140211978036

Input:

outputbase2-000-raar-effect-onderste-regel-didjvu zonder tekst.pdf

Output: ocrmypdf_comp.pdf

Result looks quite decent.

MerlijnWajer commented 2 years ago

If you're on Ubuntu, you can run apt-get install jbig2enc to get the package - that should save you from building leptonica and jbig2enc. Glad to hear it works, I think some knobs should probably added soon to control the compression ratio and compression tools being used.

jknockaert commented 2 years ago

Output: ocrmypdf_comp.pdf

Result looks quite decent.

This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.

rmast commented 2 years ago

Another run, now on an existing Ubuntu 20 installation. I had to explicitly install archive-hocr-tools. Source: Lymevereniging Online community.pdf

ocrmypdf -l nld '/home/robert/Afbeeldingen/Lymevereniging Online community.pdf' output_pdf2 python3 pdfcomp.py output_pdf2 ocrmypdf_comp.pdf Compression factor: 4.368691391278808 Result: ocrmypdf_comp.pdf

apt-cache search jbig2
libjbig2dec0 - JBIG2 decoder library - shared libraries
libjbig2dec0-dev - JBIG2 decoder library - development files
jbig2dec - JBIG2 decoder library - tools
leptonica-progs - sample programs for Leptonica image processing library
libjpedal-jbig2-java - library for accession of large images
liblept5 - image processing library
libleptonica-dev - image processing library

I see leptonica, but no jbig2enc in Ubuntu20. Do you have a special apt-sources-package for it?

rmast commented 2 years ago

This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.

I agree. The result of my second try has a too high quality of 139,6 kilobyte, with which the artifacts of an inkjet-printer are even clearly visible. I would expect a result looking for the most optimal jbig2 representation, clearing out these dried out print heads, making use of the OCR result to optimize the JBIG2 choices.

jknockaert commented 2 years ago

I think the optimisation technique should applied should reflect the print technology that was used to produce the original document. A lot of documents use a very limited number of colours in a bitonal way (either foreground colour or background). The optimisation technique should identify these colours and produce a 300dpi bitonal layer for each colour (plus a uniform background layer) and put the layers on top of each other. And perhaps identify picture areas and use an appropriate format for these (jpg or other).

MerlijnWajer commented 2 years ago

Output: ocrmypdf_comp.pdf Result looks quite decent.

This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.

Yes, there are other compression techniques out there, such as special casing bitonal images or otherwise images with a few colours, but that is not -currently- implemented in what I linked above. What I linked above implements MRC (https://en.wikipedia.org/wiki/Mixed_raster_content) - much like the commercial luratech/foxit offerings, which works great for all kinds of scanned documents. The OpenJPEG version is worse quality (and compression) wise than the kakadu version, but it only uses free software, so I figured it was better for testing it out. The technique is much like what is described here: https://www.youtube.com/watch?v=RmAPYpizl3M

I'm happy to work with you and others on other encoding techniques to better encode certain input documents, but the solution I linked above works on any input document (as long the images are not transparent). I can easily add a flag for 1-bit input (or detect it) and just encode the whole thing with JBIG2. But anything more DjVu like will take quite some work (and is also limited in some ways).

My hope was that if there is demand for compression (as in this issue) we can look at integrating a tool like pdfcomp in the OCRmyPDF workflow, and then look at extending the features that pdfcomp offers.

rmast commented 2 years ago

I agree there must have been put much effort in DjVu. Especially the efficiency of the expired patented routines. With the progress of AI and processing power I would expect new ways to come in reach. I'm curious whether the complexity of selecting the best mesh of Technologies could be organized in an open source community. A commercial product like Acrobat Pro does recognize and match characters, but not with the purpose of optimal and fully automatic compression.

Outlook voor Android downloadenhttps://aka.ms/ghei36


From: Merlijn Wajer @.> Sent: Saturday, June 11, 2022 4:32:33 PM To: ocrmypdf/OCRmyPDF @.> Cc: rmast @.>; Mention @.> Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541)

Output: ocrmypdf_comp.pdfhttps://github.com/ocrmypdf/OCRmyPDF/files/8882734/ocrmypdf_comp.pdf Result looks quite decent.

This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.

Yes, there are other compression techniques out there, such as special casing bitonal images or otherwise images with a few colours, but that is not -currently- implemented in what I linked above. What I linked above implements MRC (https://en.wikipedia.org/wiki/Mixed_raster_content) - much like the commercial luratech/foxit offerings, which works great for all kinds of scanned documents. The OpenJPEG version is worse quality (and compression) wise than the kakadu version, but it only uses free software, so I figured it was better for testing it out. The technique is much like what is described here: https://www.youtube.com/watch?v=RmAPYpizl3M

I'm happy to work with you and others on other encoding techniques to better encode certain input documents, but the solution I linked above works on any input document (as long the images are not transparent). I can easily add a flag for 1-bit input (or detect it) and just encode the whole thing with JBIG2. But anything more DjVu like will take quite some work (and is also limited in some ways).

My hope was that if there is demand for compression (as in this issue) we can look at integrating a tool like pdfcomp in the OCRmyPDF workflow, and then look at extending the features that pdfcomp offers.

— Reply to this email directly, view it on GitHubhttps://github.com/ocrmypdf/OCRmyPDF/issues/541#issuecomment-1152938609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5UNI7YLLO777IGDS3TVOSPQDANCNFSM4MP5VF3Q. You are receiving this because you were mentioned.Message ID: @.***>

rmast commented 2 years ago

But I believe the compressed quality of this picture is still better than when I compress it with PDF24 to 150dpi 75% quality, which is still twice as big: output_pdf215075pct.pdf

So I believe, even with the rather big size it's quite competitive to current alternatives.

rmast commented 2 years ago

@jknockaert looking at the expected quality, would you think more of the quality that you can achieve via the non-open-source DjVu route: base220611-000.pdf (32,9kb, so just a quarter of what we have now)

MerlijnWajer commented 2 years ago

@jknockaert looking at the expected quality, would you think more of the quality that you can achieve via the non-open-source DjVu route: base220611-000.pdf (32,9kb, so just a quarter of what we have now)

This file doesn't contain the text layer, for what it is worth. If you run this through recode_pdf with default params you will get a 40kB PDF file with the text layer. Part of the reason I think it looks a bit poor is that the scan itself is actually not of a very high quality or resolution. At least the image I took from the output_pdf215075pct.pdf. But again, I think that if we want to discuss the various ways of doing PDF compression, this particular issue might not be the best place? Not sure, I'm just trying to move this feature for OCRmyPDF forward.

In any case, if there's interest in integrating this, I'd appreciate some guidances from @jbarlow83 or others on where to do such an integration.

Should I look at making a plugin that uses the default OCRmyPDF license, and then shells out to pdfcomp? And how would we let users potentially pass parameters if they want to? We can pass them along on the command line, but I am wondering more specifically about the OCRmyPDF part. So: if there would be a some command to shell out to that would aggressively compress PDFs, how would you think it ought to be integrated for the users?

My idea is to have some "presets" - high quality generic compression, lower quality generic compression, bitonal compression, etc. And potentially we could have the user specify if they want JPEG or JPEG2000, and what encoders they'd want to use. (These things matter somewhat to the required dependencies as well as the quality)

rmast commented 2 years ago

Thete must be a snap-version of jbig2enc for Ubuntu 20:

https://snapcraft.io/install/jbig2enc/ubuntu

However snap applications aren't allowed on /tmp

MerlijnWajer commented 2 years ago

Should I look at making a plugin that uses the default OCRmyPDF license, and then shells out to pdfcomp? And how would we let users potentially pass parameters if they want to? We can pass them along on the command line, but I am wondering more specifically about the OCRmyPDF part. So: if there would be a some command to shell out to that would aggressively compress PDFs, how would you think it ought to be integrated for the users?

My idea is to have some "presets" - high quality generic compression, lower quality generic compression, bitonal compression, etc. And potentially we could have the user specify if they want JPEG or JPEG2000, and what encoders they'd want to use. (These things matter somewhat to the required dependencies as well as the quality)

Just to add to this, it would be best to use lossless images in OCRmyPDF when attempting to compress the PDF later with pdfcomp. For example, if the input images are PNG or TIFF, it woul be best not to make a PDF with JPEGs and then have that be compressed with pdfcomp. It'd be better to just insert the PNGs losslessly, and let the compression tool sort it out - this prevents additional compression artifacts from sneaking in.

MerlijnWajer commented 2 years ago

Something like this ought to result in compressed, but decent quality PDFs (of course, insert your own dpi):

$ $ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 0000.tiff out.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 156.90page/s]
OCR: 100%|█████████████████████████████████████████████████████████| 1.0/1.0 [00:59<00:00, 59.38s/page]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.80s/page]
Output file is a PDF/A-2B (as expected)

$ pdfcomp out.pdf out_c.pdf
Compression factor: 253.28204508856683

$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2552  3508  rgb     3   8  image  no        10  0   600   600 12.0M  47%

$ pdfimages -list out_c.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     850  1168  rgb     3   8  jpx    no        15  0   200   200 1450B 0.0%
   1     1 image    2552  3508  rgb     3   8  jpx    no        16  0   600   600 18.2K 0.1%
   1     2 smask    2552  3508  gray    1   1  jbig2  no        16  0   600   600 18.0K 1.6%

$ ls -lsh 0000.tiff out.pdf out_c.pdf
26M -rw-r--r-- 1 merlijn merlijn 26M Jun 12 16:03 0000.tiff
52K -rw-r--r-- 1 merlijn merlijn 49K Jun 12 16:18 out_c.pdf
13M -rw-r--r-- 1 merlijn merlijn 13M Jun 12 16:22 out.pdf

(the initial version of pdfcomp is here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2479534970)

rmast commented 2 years ago

What source picture leads to such low sizes? Looking at the figures I am used to lower resolution pictures with MRC, both for background and foreground-color pictures, only the jb2-image should be at the original resolution, however in your overview I see image 1 is still 600x600dpi. Are you using ROI to diminish the real resolution in that jpx? When I follow the link I find a seamingly other set of wheels for archive-pdf-tools, there is no pdfcomp when I search that git-repo.

MerlijnWajer commented 2 years ago

The picture is a scan of a bank statement, so I can't share that photo, but it's just a white page with a logo and some text on it. The foreground is typically not downsampled at all, only the background is, so that is 'normal'. The latest version I linked about should contain pdfcomp, but it's not in the master branch, because it is still experimental. You can see it here: https://github.com/internetarchive/archive-pdf-tools/tree/pdf-metadata-tooling

rmast commented 2 years ago

I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.

robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf
WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2'
   INFO - Input file is not a PDF, checking if it is an image...
   INFO - Input file is an image
   INFO - Input image has no ICC profile, assuming sRGB
   INFO - Image seems valid. Try converting to PDF...
   INFO - Successfully converted to PDF, processing...
Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s]
   INFO - Using Tesseract OpenMP thread limit 3
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page]
   INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 8.75× larger than the input file.
Possible reasons for this include:
Optimization was disabled.
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf
Compression factor: 47.39642946807007
robert@robert-virtual-machine:~$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    5196  7001  rgb     3   8  image  no        10  0   600   600 21.4M  21%
robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2333  rgb     3   8  jpx    no        16  0   200   200 47.3K 0.4%
   1     1 image    5196  7001  rgb     3   8  jpx    no        17  0   600   600  355K 0.3%
   1     2 smask    5196  7001  gray    1   1  jbig2  no        17  0   600   600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf
2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff
464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf
 22M -rw-rw-r-- 1 robert robert  22M jun 12 22:52 out.pdf

I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.

It has kdu_compress installed.

MerlijnWajer commented 2 years ago

I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.

robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf
WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2'
   INFO - Input file is not a PDF, checking if it is an image...
   INFO - Input file is an image
   INFO - Input image has no ICC profile, assuming sRGB
   INFO - Image seems valid. Try converting to PDF...
   INFO - Successfully converted to PDF, processing...
Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s]
   INFO - Using Tesseract OpenMP thread limit 3
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page]
   INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 8.75× larger than the input file.
Possible reasons for this include:
Optimization was disabled.
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf
Compression factor: 47.39642946807007
robert@robert-virtual-machine:~$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    5196  7001  rgb     3   8  image  no        10  0   600   600 21.4M  21%
robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2333  rgb     3   8  jpx    no        16  0   200   200 47.3K 0.4%
   1     1 image    5196  7001  rgb     3   8  jpx    no        17  0   600   600  355K 0.3%
   1     2 smask    5196  7001  gray    1   1  jbig2  no        17  0   600   600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf
2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff
464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf
 22M -rw-rw-r-- 1 robert robert  22M jun 12 22:52 out.pdf

I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.

It has kdu_compress installed.

I have made an issue here (https://github.com/internetarchive/archive-pdf-tools/issues/51) so that we don't need to bother others with some implementation details wrt compression. Let's figure out if we can make this work for you in the way that it works for me, and we can report back here.

jbarlow83 commented 2 years ago

v13.5.0 (when released, currently testing) will add support for a plugin hook to replace ocrmypdf's default optimizer. As previously promised.

Hopefully this will make it easier to test and better integrated these changes with ocrmypdf.

Of course I'd prefer, where technically and legally possible, to incorporate improvements directly into ocrmypdf.