Closed heinrich-ulbricht closed 11 months ago
I would go with modifying ocrmypdf, and:
pngquant.quantize
with code that always converts the image to 1bpp (e.g. just use PIllow).Instead of forcing PNG input, you could also uncomment the optimize.py:523 "try pngifying the jpegs" which as the name suggests, speculatively converts JPEGs to PNGs. I believe this had a few corner cases to be worked out and is too costly in performance in the typical case, but you could try that, especially if you are forcing everything to JBIG2 anyway.
I'm giving it a try and am having some success.
@jbarlow83 A question: This return
doesn't look right since it leaves the function after handling only one image. Is this ok?
For me this leads to only one of multiple images being handled in a multi-page PDF, where each page contains one image. (Since the loop cannot finish.)
And one (related?) curiosity: I managed to modify the conversion pipeline such that I now have multiple 1 bpp PNGs waiting in the temp folder to be handled. If there is only one such PNG the resulting PDF looks fine. If there are multiple such images the resulting PDF is distorted. Looking at the images in the temp folder I got:
Then the code converts those TIFs to JBIG2 file(s) by invoking the jbig2 tool. This seems to be errorneous if there are multiple TIFs (leading to distortions in the final PDF). It works for one TIF though. So the question is: do you have a test in place checking that PDFs with multiple 1 bpp images can correctly be converted to the JBIG2 format? Or could this be a bug?
Note: I suspect that above mentioned return
prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.
(But this might also be me not understanding how the final JBIG2 handling works. I might have broken something with my modifications.)
Edit: the debug output shows me this command line that is being used by OCRmyPDF:
DEBUG - Running: ['jbig2', '-b', 'group00000000', '-s', '-p', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000032.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000028.tif', '/tmp/com.github.ocrmypdf.ylclub9u/images/00000030.tif']
The TIF files look good.
I found the reason why my PDF containing the 1 bpp JBIG2 images was distorted. The color space of the embedded images was not correct. It was still /DeviceRGB
:
But correct would be /DeviceGray
.
I was able to quick-fix this by inserting im_obj.ColorSpace = Name("/DeviceGray")
right before this line:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L340
The PDF now looks good.
Hypothesis: it was never intended to change the color space during image optimization?
Edit: Suggested fix:
if (Name.BitsPerComponent in im_obj and im_obj.BitsPerComponent == 1):
log.debug("Setting ColorSpace to /DeviceGray")
im_obj.ColorSpace = Name("/DeviceGray")
Edit2:
Better fix?
Add im_obj.ColorSpace = Name("/DeviceGray")
here:
https://github.com/jbarlow83/OCRmyPDF/blob/58abb5785cf55d0cfddeee017e81ca4a8250a94c/src/ocrmypdf/optimize.py#L430
I implemented and pushed a solution that works for me and is basically a shortcut to TIF generation (see above linked commit). I added a new user script option that can be used to run arbitrary shell commands on images. This user script takes the source and destination file pathes as input parameter and must convert the source image to a 1 bpp TIF.
The shell script that works for me looks like this:
#!/bin/sh
convert -colorspace gray -fill white -sharpen 0x2 "$1" - | jpegtopnm | pamthreshold | pamtotiff -g4 > "$2"
This requires ImageMagick and netpbm-progs to be installed. But one could use other conversion tools here as well. pamthreshold
implements a nice dynamic threshold.
The command that I used to test looks like this:
ocrmypdf --user-script-jpg-to-1bpp-tif shell.sh --jbig2-lossy -v 1 -O3 in.pdf out.pdf
img2pdf
)I'm not opening a pull request since the solution is very narrow to my use case. And right now it only handles JPEG images. But maybe somebody finds this useful as a starting point.
I suspect that above mentioned return prevented multiple JBIG2 files from ever being inserted into the final PDF - since the loop always terminates after generating one TIF.
You are correct, those return
s are wrong and will suppress multiple images per file. That's a great catch.
Hypothesis: it was never intended to change the color space during image optimization?
Also correct. /DeviceGray is not correct in general, but probably suitable for your use case. Some files will specify a complex colorspace instead of /DeviceRGB and changing to /DeviceGray may not be correct, so optimize tries to avoid changing colorspace. It is also possible to specify a 1-bit color colorspace, e.g. 0 is blue and 1 is red.
I'm not opening a pull request since the solution
Agreed - that's a lot of new dependencies to add.
I also needed exactly this!
I tried to rebase unto master, missed some things in the manual merges required and added them afterwards, so my branch doesn’t look so clean right now. But here it is: https://github.com/andersjohansson/OCRmyPDF/tree/feature/github-541-reduce-output-file-size-v10
It works fine now though! Thanks!
userscript.py could be structured as a plugin instead (new feature for 10.x). You'd need to create a new hook as well by adding it to pluginspec.py
, and then we could have a generic, pluggable interface for people who want to optimize images more aggressively.
If @heinrich-ulbricht or anyone else is interested in looking more into this in the future, see also the comments that @jbarlow83 added here: https://github.com/andersjohansson/OCRmyPDF/commit/4e5b68f1b966312edeba8ef3b6e12037bac8aef6
What about using MRC compression to visually keep the file as much as the original but loosing lots of size as @jbarlow83 mentioned here:
https://github.com/jbarlow83/OCRmyPDF/issues/836#issuecomment-922560147
(We do not do page color segmentation at this time, i.e., finding regions of a page or image that can be represented with a reduced colorspace. It's not an easy feature to implement and will probably need a corporate sponsor so that I can work on it full time for a few weeks. You do get better compression if you're able to work with the original PDFs.)
You could just look at how closed source DjVuSolo 3.1 does reach astonishing sizes with really legible results, and even keeping color in JBIG2-like JB2. With DjVuToy you can transform those DjVu's into PDF's that are only about twice as big.
With https://github.com/jwilk/didjvu there has been an attempt to open source this MRC-mechanism, however with some inconveniences that keep files too big to be a serious candidate to replace the old DjVuSolo 3.1 in the Russian user group.
However many DjVu-patents have expired, so there might be some valuable MRC-knowledge in those patents, as @jsbien suggested.
@rmast This is interesting information and could be helpful if I ever get the opportunity to implement this. Thanks.
(Found this through @rmast) -- If you're looking for a MRC implementation, https://github.com/internetarchive/archive-pdf-tools does this when it creates PDFs with text layers (it's mostly like OCRMyPDF but doesn't attempt to do OCR and requires that be done externally) - the MRC code can also be used as a library, although I probably need to make the API a bit more ... accessible. @jbarlow83 - If you're interested in this I could try to make a more accessible API. Alternatively, I could look at improving the "pdf recoding" method some where the software compresses an existing PDF by replacing the images with MRC compression images, so then one could just run recode_pdf after OCRmyPDF has done its thing.
@MerlijnWajer Thanks for the suggestion - that is impressive work. Unfortunately it's license-incompatible (AGPL) and also uses PyMuPDF as its PDF generator. I like PyMuPDF and used it previously, but it relies on libmupdf which is only released as a static library and doesn't promise a stable interface, meaning that Linux distributions won't include it.
But setting it up through a plugin interface, calling recode_pdf by command line, would certainly be doable.
I'll try to implement this mode (modifying the images of a PDF without touching most other parts) in the next week or so and report back, then we could maybe look at the plugin path. (Actually, give me more like two weeks, I'll have to do some refactoring to support this recoding mode)
It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.
It looks like you/archive.org may be the sole copyright holder. If you're willing to contribute portions of your existing code to ocrmypdf under its MPL2 license we could also work in it that way.
Right - I'll have to think about that (and also ask). For now I will try to get a tool to recode an existing PDF working first, since I've been wanting to add/implement that for a long time anyway, and this is a great motivation to do it. I'll also make the MRC API more usable (current code is heavily optimised for performance, not for API usability), though, so we could revisit the potential license situation once that is done.
@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: https://github.com/ocrmypdf/OCRmyPDF/issues/9 https://github.com/fritz-hh/OCRmyPDF/issues/88
I understand license-(in)compatibility is inhibiting progress.
I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.
Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?
@blaueente @v217 I saw your input in these issues concerning introducing MRC into OCRMyPDF: #9 fritz-hh/OCRmyPDF#88
I understand license-(in)compatibility is inhibiting progress.
I was also looking into didjvu for understanding the MRC-compression overthere. MRC is reached by that tool by a Gamera didjvu-binarizer, followed by C44 of the djvulibre tooling for both the fore and background, so the license of didjvu is probably less important than the licenses of Gamera and C44.
Didjvu itself mainly deals with organizing everything, so I guess one couldn't use code from it directly anyways. C44 / iw44 is the wavelet codec used by didjvu, and therefore unusable for PDF MRCs. The ideas of archive-pdf-tools seem pretty good to me, maybe they could learn from gamera's separation algorithms, and the ROI-style coding of iw44, although I see good discussions in their github page.
Do you have experience with getting products with those incompatible licenses alive? Would the same question be different when trying to get GScan2PDF (GPLv3) use MRC?
Regarding licenses, I can't really help you. The approach of @MerlijnWajer sounds great though. Talk about what can be shared, and what can be just re-used as separate interfacing binaries.
I was experimenting with a script a while ago but couldn't get it to fully work on oddball PDFs and then gave up for a bit. But I think I just realised that at least for PDFs generated by OCRmyPDF, this is a non-issue. Does anyone have some sample/test PDFs created by OCRMyPDF that I could run my script on?
OK, I installed it on a debian machine and ran a few tests. It seems to work at least for my basic testing (see attached files, input image, ocrmypdf output given input image, MRC compressed pdf)
The text layer and document metadata seems untouched, and pdfimages output seems sensible:
$ pdfimages -list /tmp/ocrmypdf.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2472 3484 rgb 3 8 jpeg no 12 0 762 762 635K 2.5%
$ pdfimages -list /tmp/out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2472 3484 rgb 3 8 jpx no 16 0 762 762 12.8K 0.1%
1 1 image 2472 3484 rgb 3 8 jpx no 17 0 762 762 62.6K 0.2%
1 2 smask 2472 3484 gray 1 1 jbig2 no 17 0 762 762 41.9K 4.0%
Sorry for the delay, but it looks like this is workable, so I could clean up the code and we can do some more testing?
VeraPDF also doesn't seem to complain:
~/verapdf/verapdf --format text --flavour 2b /tmp/out.pdf
PASS /tmp/out.pdf
Here is my compression script from a few months back, it's very much work in progress so please don't use it for any production purposes (but of course, please test and report back):
https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py (apologies for the mess, it is a -test- script)
The only argument is the input pdf, and then it will save the compressed PDF to /tmp/out.pdf
. You will need archive-pdf-tools==1.4.13
installed (available via pip). Depending on which code is commented it can compress JPEG2000 using Pillow, JPEG using jpegoptim, or JPEG2000 using kakadu.
If this test code/script seems to do the job, I can extend it to also support conversion to bitonal ccitt/jbig2 (as mentioned in #906) given a flag or something and tidy it up.
As stated earlier, complex PDFs with many images and transparency don't work well yet, but for that I'd have to look at the transformations of the pages, the images, transparency, etc... which I don't think is an issue for OCRmyPDF compression use cases?
One thing that I'd like to add is to extract the text layer from a PDF to hOCR, so that it can be used as input for the script, so that it knows where the text areas are. This is actually not far off at all, I already have some local code for it, so depending on the feedback here I can could try to integrate that.
I tried your script on a newly arrived ABN AMRO-letter of two pages. The resulting out.pdf is 129 kb, and the letters ABN AMRO on top are quite vague. DjvuSolo 3.1/DjVuToy reach 46 kb with sharper ABN AMRO letters and less fuzz around the pricing table.
I had to compile Leptonica 1.72, as the suggested leptonica 1.68 in jbig2enc didn't compile right with libpng-dev. I used an Ubuntu 20 image on Azure
sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip
pip install archive-pdf-tools==1.4.13
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
git clone https://github.com/agl/jbig2enc.git
wget https://archive.org/~merlijn/recode-existing-pdf-WORKINPROGRESS.py
cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install
cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install
Right, the current code is also inferior to what the normal tooling does since that uses the text layer info as well, but once I add that (I will try to do that soon), it could be better.
DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.
DjVu is a fun comparison but it has the advantage of being able to use image formats that are not supported in PDF.
That's where DjVuToy comes in, that converts the DjVu-result of DjVuSolo3.1 to a JBIG2/JPEG2000 PDF of 46kb. The DjVu itself is only 31kb.
I can't find the source for that program. Is it free software? (If not: maybe another issue/place would be better to discuss that?)
No, both are closed source. DjVuSolo3.1 is a very old pre-commercial demo of the capabilities of DjVu. When they commercialized DjVu they rated it at such high prices that DjVu priced itself out of the market. I guess the Internet Archive once used DjVu. DjVuToy is actively maintained by a Chinese enthousiast, but he's not planning on opening the source.
Here the result via DjVuSolo3.1/DjVuToy3.06 unicode edition, half as small as your result from the Covid-health-form:
rmast@Ubuntu20:~$ pdfimages -list in.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 824 1162 rgb 3 8 jpx yes 1 0 100 101 3080B 0.1%
1 1 stencil 2472 3484 - 1 1 jbig2 no 3 0 300 300 17.6K 1.7%
1 2 stencil 2472 3484 - 1 1 jbig2 no 4 0 300 300 347B 0.0%
1 3 stencil 2472 3484 - 1 1 jbig2 no 5 0 300 300 68B 0.0%
1 4 stencil 2472 3484 - 1 1 jbig2 no 6 0 300 300 2137B 0.2%
1 5 stencil 2472 3484 - 1 1 jbig2 no 7 0 300 300 618B 0.1%
1 6 stencil 2472 3484 - 1 1 jbig2 no 8 0 300 300 984B 0.1%
1 7 stencil 2472 3484 - 1 1 jbig2 no 9 0 300 300 357B 0.0%
1 8 stencil 2472 3484 - 1 1 jbig2 no 10 0 300 300 6063B 0.6%
1 9 stencil 2472 3484 - 1 1 jbig2 no 11 0 300 300 324B 0.0%
1 10 stencil 2472 3484 - 1 1 jbig2 no 12 0 300 300 11.2K 1.1%
1 11 stencil 2472 3484 - 1 1 jbig2 no 13 0 300 300 125B 0.0%
1 12 stencil 2472 3484 - 1 1 jbig2 no 14 0 300 300 114B 0.0%
1 13 stencil 2472 3484 - 1 1 jbig2 no 15 0 300 300 322B 0.0%
1 14 stencil 2472 3484 - 1 1 jbig2 no 16 0 300 300 129B 0.0%
1 15 stencil 2472 3484 - 1 1 jbig2 no 17 0 300 300 246B 0.0%
1 16 stencil 2472 3484 - 1 1 jbig2 no 18 0 300 300 210B 0.0%
1 17 stencil 2472 3484 - 1 1 jbig2 no 19 0 300 300 335B 0.0%
1 18 stencil 2472 3484 - 1 1 jbig2 no 20 0 300 300 194B 0.0%
1 19 stencil 2472 3484 - 1 1 jbig2 no 21 0 300 300 74B 0.0%
1 20 stencil 2472 3484 - 1 1 jbig2 no 22 0 300 300 170B 0.0%
1 21 stencil 2472 3484 - 1 1 jbig2 no 23 0 300 300 349B 0.0%
1 22 stencil 2472 3484 - 1 1 jbig2 no 24 0 300 300 325B 0.0%
1 23 stencil 2472 3484 - 1 1 jbig2 no 25 0 300 300 109B 0.0%
1 24 stencil 2472 3484 - 1 1 jbig2 no 26 0 300 300 139B 0.0%
1 25 stencil 2472 3484 - 1 1 jbig2 no 27 0 300 300 271B 0.0%
1 26 stencil 2472 3484 - 1 1 jbig2 no 28 0 300 300 913B 0.1%
1 27 stencil 2472 3484 - 1 1 jbig2 no 29 0 300 300 138B 0.0%
1 28 stencil 2472 3484 - 1 1 jbig2 no 30 0 300 300 113B 0.0%
1 29 stencil 2472 3484 - 1 1 jbig2 no 31 0 300 300 116B 0.0%
1 30 stencil 2472 3484 - 1 1 jbig2 no 32 0 300 300 117B 0.0%
1 31 stencil 2472 3484 - 1 1 jbig2 no 33 0 300 300 401B 0.0%
1 32 stencil 2472 3484 - 1 1 jbig2 no 34 0 300 300 202B 0.0%
rmast@Ubuntu20:~$ ls -al in.pdf
-rw-rw-r-- 1 rmast rmast 58988 May 5 18:00 in.pdf
The many jbig2-pictures stem from all the colors in the JB2-picture. DjVuToy translates those to separate images with their own color.
Especially take a look at the clearness of the background picture...
So I've cleaned up the code a bit and am looking for some people to try and run it on their OCRMyPDF results. (Let's not focus on DjVu stuff here please, as I'm trying to make a tool that people can use based on existing/working code)
You'll need this build of archive-pdf-tools: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477636215 (just click on the artifact download link and pick the release for your os/python interpreter from the artifact.zip
)
And then download this script: https://archive.org/~merlijn/pdfcomp.py
Use like so:
$ python pdfcomp.py /tmp/ocrmypdf.pdf /tmp/ocrmypdf_comp.pdf
Compression factor: 5.193651663405088
Some random notes...
$ grep -a Tess /tmp/ocrmypdf_comp.pdf
/Creator (ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3)
<xmp:CreatorTool>ocrmypdf 6.1.2 / Tesseract OCR-PDF 4.1.3</xmp:CreatorTool></rdf:Description>
(Sorry for the noise, another build for folks who don't have kdu_expand and kdu_compress)
I added another build here that doesn't rely on kakadu for JPEG2000, but rather on Pillow having JPEG2000 support: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2477739599
Usage is the same. The only external requirement now that I know of is https://github.com/agl/jbig2enc - I can also build a version that doesn't need that either, but that will come at a compression cost. (I can make these flags to compress-pdf-images
in the near future)
Again on a fresh Ubuntu 20 LTS image on Azure, and took the second download, as the first crashed:
sudo apt-get update
sudo apt-get install automake git libtool libpng-dev build-essential make ocrmypdf pip tesseract-ocr-nld
vi ~/.bashrc
export PATH=$PATH:/home/rmast/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
git clone https://github.com/agl/jbig2enc.git
cd leptonica/
git checkout v1.72
chmod +x configure
./configure
make
sudo make install
cd ../jbig2enc/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
cd ..
ls -al
pip install archive_pdf_tools-1.4.16-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
pip install lxml==4.6.5
wget https://archive.org/~merlijn/pdfcomp.py
ocrmypdf -l nld 'outputbase2-000-raar-effect-onderste-regel-didjvu zonder tekst.pdf' output_pdf
python3 pdfcomp.py output_pdf ocrmypdf_comp.pdf
Compression factor: 3.92140211978036
Input:
outputbase2-000-raar-effect-onderste-regel-didjvu zonder tekst.pdf
Output: ocrmypdf_comp.pdf
Result looks quite decent.
If you're on Ubuntu, you can run apt-get install jbig2enc
to get the package - that should save you from building leptonica and jbig2enc. Glad to hear it works, I think some knobs should probably added soon to control the compression ratio and compression tools being used.
Output: ocrmypdf_comp.pdf
Result looks quite decent.
This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.
Another run, now on an existing Ubuntu 20 installation. I had to explicitly install archive-hocr-tools. Source: Lymevereniging Online community.pdf
ocrmypdf -l nld '/home/robert/Afbeeldingen/Lymevereniging Online community.pdf' output_pdf2 python3 pdfcomp.py output_pdf2 ocrmypdf_comp.pdf Compression factor: 4.368691391278808 Result: ocrmypdf_comp.pdf
apt-cache search jbig2
libjbig2dec0 - JBIG2 decoder library - shared libraries
libjbig2dec0-dev - JBIG2 decoder library - development files
jbig2dec - JBIG2 decoder library - tools
leptonica-progs - sample programs for Leptonica image processing library
libjpedal-jbig2-java - library for accession of large images
liblept5 - image processing library
libleptonica-dev - image processing library
I see leptonica, but no jbig2enc in Ubuntu20. Do you have a special apt-sources-package for it?
This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.
I agree. The result of my second try has a too high quality of 139,6 kilobyte, with which the artifacts of an inkjet-printer are even clearly visible. I would expect a result looking for the most optimal jbig2 representation, clearing out these dried out print heads, making use of the OCR result to optimize the JBIG2 choices.
I think the optimisation technique should applied should reflect the print technology that was used to produce the original document. A lot of documents use a very limited number of colours in a bitonal way (either foreground colour or background). The optimisation technique should identify these colours and produce a 300dpi bitonal layer for each colour (plus a uniform background layer) and put the layers on top of each other. And perhaps identify picture areas and use an appropriate format for these (jpg or other).
Output: ocrmypdf_comp.pdf Result looks quite decent.
This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.
Yes, there are other compression techniques out there, such as special casing bitonal images or otherwise images with a few colours, but that is not -currently- implemented in what I linked above. What I linked above implements MRC (https://en.wikipedia.org/wiki/Mixed_raster_content) - much like the commercial luratech/foxit offerings, which works great for all kinds of scanned documents. The OpenJPEG version is worse quality (and compression) wise than the kakadu version, but it only uses free software, so I figured it was better for testing it out. The technique is much like what is described here: https://www.youtube.com/watch?v=RmAPYpizl3M
I'm happy to work with you and others on other encoding techniques to better encode certain input documents, but the solution I linked above works on any input document (as long the images are not transparent). I can easily add a flag for 1-bit input (or detect it) and just encode the whole thing with JBIG2. But anything more DjVu like will take quite some work (and is also limited in some ways).
My hope was that if there is demand for compression (as in this issue) we can look at integrating a tool like pdfcomp
in the OCRmyPDF workflow, and then look at extending the features that pdfcomp
offers.
I agree there must have been put much effort in DjVu. Especially the efficiency of the expired patented routines. With the progress of AI and processing power I would expect new ways to come in reach. I'm curious whether the complexity of selecting the best mesh of Technologies could be organized in an open source community. A commercial product like Acrobat Pro does recognize and match characters, but not with the purpose of optimal and fully automatic compression.
Outlook voor Android downloadenhttps://aka.ms/ghei36
From: Merlijn Wajer @.> Sent: Saturday, June 11, 2022 4:32:33 PM To: ocrmypdf/OCRmyPDF @.> Cc: rmast @.>; Mention @.> Subject: Re: [ocrmypdf/OCRmyPDF] Introduce a way to radically reduce the output file size (sacrificing image quality) (#541)
Output: ocrmypdf_comp.pdfhttps://github.com/ocrmypdf/OCRmyPDF/files/8882734/ocrmypdf_comp.pdf Result looks quite decent.
This is still different from other optimisation techniques I've seen, where the same document would be cut up in a number of monochrome areas each encoded at 1bpp at a high resolution (300dpi). For this specific document you would only need three colours (black, grey and orange). With each area encoded at 300dpi you would get an excellent result (also allowing for crisp prints) and a very small pdf file size.
Yes, there are other compression techniques out there, such as special casing bitonal images or otherwise images with a few colours, but that is not -currently- implemented in what I linked above. What I linked above implements MRC (https://en.wikipedia.org/wiki/Mixed_raster_content) - much like the commercial luratech/foxit offerings, which works great for all kinds of scanned documents. The OpenJPEG version is worse quality (and compression) wise than the kakadu version, but it only uses free software, so I figured it was better for testing it out. The technique is much like what is described here: https://www.youtube.com/watch?v=RmAPYpizl3M
I'm happy to work with you and others on other encoding techniques to better encode certain input documents, but the solution I linked above works on any input document (as long the images are not transparent). I can easily add a flag for 1-bit input (or detect it) and just encode the whole thing with JBIG2. But anything more DjVu like will take quite some work (and is also limited in some ways).
My hope was that if there is demand for compression (as in this issue) we can look at integrating a tool like pdfcomp in the OCRmyPDF workflow, and then look at extending the features that pdfcomp offers.
— Reply to this email directly, view it on GitHubhttps://github.com/ocrmypdf/OCRmyPDF/issues/541#issuecomment-1152938609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5UNI7YLLO777IGDS3TVOSPQDANCNFSM4MP5VF3Q. You are receiving this because you were mentioned.Message ID: @.***>
But I believe the compressed quality of this picture is still better than when I compress it with PDF24 to 150dpi 75% quality, which is still twice as big: output_pdf215075pct.pdf
So I believe, even with the rather big size it's quite competitive to current alternatives.
@jknockaert looking at the expected quality, would you think more of the quality that you can achieve via the non-open-source DjVu route: base220611-000.pdf (32,9kb, so just a quarter of what we have now)
@jknockaert looking at the expected quality, would you think more of the quality that you can achieve via the non-open-source DjVu route: base220611-000.pdf (32,9kb, so just a quarter of what we have now)
This file doesn't contain the text layer, for what it is worth. If you run this through recode_pdf
with default params you will get a 40kB PDF file with the text layer. Part of the reason I think it looks a bit poor is that the scan itself is actually not of a very high quality or resolution. At least the image I took from the output_pdf215075pct.pdf
. But again, I think that if we want to discuss the various ways of doing PDF compression, this particular issue might not be the best place? Not sure, I'm just trying to move this feature for OCRmyPDF forward.
In any case, if there's interest in integrating this, I'd appreciate some guidances from @jbarlow83 or others on where to do such an integration.
Should I look at making a plugin that uses the default OCRmyPDF license, and then shells out to pdfcomp
? And how would we let users potentially pass parameters if they want to? We can pass them along on the command line, but I am wondering more specifically about the OCRmyPDF part. So: if there would be a some command to shell out to that would aggressively compress PDFs, how would you think it ought to be integrated for the users?
My idea is to have some "presets" - high quality generic compression, lower quality generic compression, bitonal compression, etc. And potentially we could have the user specify if they want JPEG or JPEG2000, and what encoders they'd want to use. (These things matter somewhat to the required dependencies as well as the quality)
Thete must be a snap-version of jbig2enc for Ubuntu 20:
https://snapcraft.io/install/jbig2enc/ubuntu
However snap applications aren't allowed on /tmp
Should I look at making a plugin that uses the default OCRmyPDF license, and then shells out to
pdfcomp
? And how would we let users potentially pass parameters if they want to? We can pass them along on the command line, but I am wondering more specifically about the OCRmyPDF part. So: if there would be a some command to shell out to that would aggressively compress PDFs, how would you think it ought to be integrated for the users?My idea is to have some "presets" - high quality generic compression, lower quality generic compression, bitonal compression, etc. And potentially we could have the user specify if they want JPEG or JPEG2000, and what encoders they'd want to use. (These things matter somewhat to the required dependencies as well as the quality)
Just to add to this, it would be best to use lossless images in OCRmyPDF when attempting to compress the PDF later with pdfcomp
. For example, if the input images are PNG or TIFF, it woul be best not to make a PDF with JPEGs and then have that be compressed with pdfcomp
. It'd be better to just insert the PNGs losslessly, and let the compression tool sort it out - this prevents additional compression artifacts from sneaking in.
Something like this ought to result in compressed, but decent quality PDFs (of course, insert your own dpi):
$ $ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 0000.tiff out.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 156.90page/s]
OCR: 100%|█████████████████████████████████████████████████████████| 1.0/1.0 [00:59<00:00, 59.38s/page]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.80s/page]
Output file is a PDF/A-2B (as expected)
$ pdfcomp out.pdf out_c.pdf
Compression factor: 253.28204508856683
$ pdfimages -list out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2552 3508 rgb 3 8 image no 10 0 600 600 12.0M 47%
$ pdfimages -list out_c.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 850 1168 rgb 3 8 jpx no 15 0 200 200 1450B 0.0%
1 1 image 2552 3508 rgb 3 8 jpx no 16 0 600 600 18.2K 0.1%
1 2 smask 2552 3508 gray 1 1 jbig2 no 16 0 600 600 18.0K 1.6%
$ ls -lsh 0000.tiff out.pdf out_c.pdf
26M -rw-r--r-- 1 merlijn merlijn 26M Jun 12 16:03 0000.tiff
52K -rw-r--r-- 1 merlijn merlijn 49K Jun 12 16:18 out_c.pdf
13M -rw-r--r-- 1 merlijn merlijn 13M Jun 12 16:22 out.pdf
(the initial version of pdfcomp
is here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2479534970)
What source picture leads to such low sizes? Looking at the figures I am used to lower resolution pictures with MRC, both for background and foreground-color pictures, only the jb2-image should be at the original resolution, however in your overview I see image 1 is still 600x600dpi. Are you using ROI to diminish the real resolution in that jpx? When I follow the link I find a seamingly other set of wheels for archive-pdf-tools, there is no pdfcomp when I search that git-repo.
The picture is a scan of a bank statement, so I can't share that photo, but it's just a white page with a logo and some text on it. The foreground is typically not downsampled at all, only the background is, so that is 'normal'. The latest version I linked about should contain pdfcomp, but it's not in the master
branch, because it is still experimental. You can see it here: https://github.com/internetarchive/archive-pdf-tools/tree/pdf-metadata-tooling
I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.
robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf
WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2'
INFO - Input file is not a PDF, checking if it is an image...
INFO - Input file is an image
INFO - Input image has no ICC profile, assuming sRGB
INFO - Image seems valid. Try converting to PDF...
INFO - Successfully converted to PDF, processing...
Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s]
INFO - Using Tesseract OpenMP thread limit 3
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page]
INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 8.75× larger than the input file.
Possible reasons for this include:
Optimization was disabled.
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf
Compression factor: 47.39642946807007
robert@robert-virtual-machine:~$ pdfimages -list out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 5196 7001 rgb 3 8 image no 10 0 600 600 21.4M 21%
robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1732 2333 rgb 3 8 jpx no 16 0 200 200 47.3K 0.4%
1 1 image 5196 7001 rgb 3 8 jpx no 17 0 600 600 355K 0.3%
1 2 smask 5196 7001 gray 1 1 jbig2 no 17 0 600 600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf
2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff
464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf
22M -rw-rw-r-- 1 robert robert 22M jun 12 22:52 out.pdf
I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.
It has kdu_compress installed.
I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.
robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2' INFO - Input file is not a PDF, checking if it is an image... INFO - Input file is an image INFO - Input image has no ICC profile, assuming sRGB INFO - Image seems valid. Try converting to PDF... INFO - Successfully converted to PDF, processing... Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s] INFO - Using Tesseract OpenMP thread limit 3 OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page] INFO - Output file is a PDF/A-2B (as expected) WARNING - The output file size is 8.75× larger than the input file. Possible reasons for this include: Optimization was disabled. robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf Compression factor: 47.39642946807007 robert@robert-virtual-machine:~$ pdfimages -list out.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 5196 7001 rgb 3 8 image no 10 0 600 600 21.4M 21% robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 1732 2333 rgb 3 8 jpx no 16 0 200 200 47.3K 0.4% 1 1 image 5196 7001 rgb 3 8 jpx no 17 0 600 600 355K 0.3% 1 2 smask 5196 7001 gray 1 1 jbig2 no 17 0 600 600 48.7K 1.1% robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf 2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff 464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf 22M -rw-rw-r-- 1 robert robert 22M jun 12 22:52 out.pdf
I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.
It has kdu_compress installed.
I have made an issue here (https://github.com/internetarchive/archive-pdf-tools/issues/51) so that we don't need to bother others with some implementation details wrt compression. Let's figure out if we can make this work for you in the way that it works for me, and we can report back here.
v13.5.0 (when released, currently testing) will add support for a plugin hook to replace ocrmypdf's default optimizer. As previously promised.
Hopefully this will make it easier to test and better integrated these changes with ocrmypdf.
Of course I'd prefer, where technically and legally possible, to incorporate improvements directly into ocrmypdf.
Is your feature request related to a problem? Please describe. My use case is "scanning" documents with a smartphone camera, then archiving those "scans" as low-quality monochrome images. But OCR should be done beforehand on the high-quality images.
I describes this in more detail here: https://github.com/jbarlow83/OCRmyPDF/issues/443#issuecomment-618589203
Furthermore I see a discussion covering a similar topic here: #293
Describe the solution you'd like I want greater control of image quality for the images embedded into the PDF (after doing OCR). I can imagine those possible solutions (each point is a complete solution):
Additional context I'm currently evaluating how to achieve my goal with the least effort. I see two approaches:
I'm not sure about the second approach - where would be a good point to start? One approach could be:
@jbarlow83 Does this sound right?