Open MerlijnWajer opened 2 years ago
Moving from https://github.com/ocrmypdf/OCRmyPDF/issues/541:
I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.
robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2' INFO - Input file is not a PDF, checking if it is an image... INFO - Input file is an image INFO - Input image has no ICC profile, assuming sRGB INFO - Image seems valid. Try converting to PDF... INFO - Successfully converted to PDF, processing... Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s] INFO - Using Tesseract OpenMP thread limit 3 OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page] INFO - Output file is a PDF/A-2B (as expected) WARNING - The output file size is 8.75× larger than the input file. Possible reasons for this include: Optimization was disabled. robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf Compression factor: 47.39642946807007 robert@robert-virtual-machine:~$ pdfimages -list out.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 5196 7001 rgb 3 8 image no 10 0 600 600 21.4M 21% robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 1732 2333 rgb 3 8 jpx no 16 0 200 200 47.3K 0.4% 1 1 image 5196 7001 rgb 3 8 jpx no 17 0 600 600 355K 0.3% 1 2 smask 5196 7001 gray 1 1 jbig2 no 17 0 600 600 48.7K 1.1% robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf 2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff 464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf 22M -rw-rw-r-- 1 robert robert 22M jun 12 22:52 out.pdf
I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.
It has kdu_compress installed.
I think the pdfcomp
that you are using probably does not use kakadu, but rather openjpeg, for which the right defaults are still a bit up in the air. If you have kakadu installed, I can give you a build that uses kakadu, or you can wait just a bit for me to implement the same flags for pdfcomp
that recode_pdf
also has (regarding re-compression of images)
I have started a new github action build for the version that uses kakadu here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2484980161
Unfortunately no change whatsoever in compression factor:
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu.pdf
Compression factor: 47.39642946807007
robert@robert-virtual-machine:~$ which pdfcomp
/usr/local/bin/pdfcomp
robert@robert-virtual-machine:~$ ls -al /usr/local/bin/pdfcomp
-rwxr-xr-x 1 root root 1003 jun 12 22:50 /usr/local/bin/pdfcomp
however
robert@robert-virtual-machine:~$ which compress-pdf-images
/home/robert/.local/bin/compress-pdf-images
robert@robert-virtual-machine:~$ ls -al /home/robert/.local/bin/compress-pdf-images
-rwxrwxr-x 1 robert robert 4922 jun 11 13:23 /home/robert/.local/bin/compress-pdf-images
I'll raw copy the latest version.
An improvement with the new raw copied version of compress-pdf-images
pdfcomp out.pdf out_ckakadu3.pdf
Compression factor: 105.51087492544183
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu3.pdf
Compression factor: 105.51087492544183
robert@robert-virtual-machine:~$ pdfimages -list out_ckakadu3.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1732 2333 rgb 3 8 jpx no 16 0 200 200 10.6K 0.1%
1 1 image 5196 7001 rgb 3 8 jpx no 17 0 600 600 137K 0.1%
1 2 smask 5196 7001 gray 1 1 jbig2 no 17 0 600 600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh out_ckakadu3.pdf
208K -rw-rw-r-- 1 robert robert 208K jun 13 22:01 out_ckakadu3.pdf
DjVu uses about 25 dpi for the foreground-picture:
https://www.cs.tufts.edu/~nr/cs257/archive/leon-bottou/jei-1998.ps.gz
With fg_downsample=12 inside compress-pdf-images robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu4.pdf Compression factor: 300.4531241641256
However one unacceptable artifact appears, the rest of the page is fine: Where the original shows:
robert@robert-virtual-machine:~$ pdfimages -list out_ckakadu4.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1732 2333 rgb 3 8 jpx no 16 0 200 200 10.6K 0.1%
1 1 image 433 583 rgb 3 8 jpx no 17 0 50 50 2183B 0.3%
1 2 smask 5196 7001 gray 1 1 jbig2 no 17 0 600 601 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh out_ckakadu4.pdf
76K -rw-rw-r-- 1 robert robert 74K jun 13 22:35 out_ckakadu4.pdf
It however also appears the foreground is more faint, probably while the colors are more blurred with the surrounding background.
I've run the bankstatement in the fully open source didjvu, It does not have the faint text nor the strange artifact shown due to bad foreground/background choices.
robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy -o bankstatementdi.djvu ../bankstatement.tiff:
The resulting numbers:
FORM:DJVU [90592]
INFO [10] DjVu 5196x7001, v24, 600 dpi, gamma=2.2
Sjbz [16853] JB2 bilevel data
FG44 [41040] IW4 data #1, 100 slices, v1.2 (color), 866x1167
BG44 [6126] IW4 data #1, 74 slices, v1.2 (color), 1732x2334
BG44 [6580] IW4 data #2, 10 slices
BG44 [2044] IW4 data #3, 6 slices
BG44 [17878] IW4 data #4, 7 slices
All BG44 slices together form the background picture.
The foreground picture default has half the resolution of the background.
When I diminish the foreground further to only 1/4 of the background I see no visual differences in the result:
robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy --fg-subsample 12 -o bankstatementdifg12.djvu ../bankstatement.tiff:
0.010 bits/pixel; 55.995:1, 98.21% saved, 2566076 bytes in, 45827 bytes out
FORM:DJVU [45815] INFO [10] DjVu 5196x7001, v24, 600 dpi, gamma=2.2 Sjbz [16844] JB2 bilevel data FG44 [13042] IW4 data #1, 100 slices, v1.2 (color), 433x584 BG44 [4111] IW4 data #1, 74 slices, v1.2 (color), 1732x2334 BG44 [2982] IW4 data #2, 10 slices BG44 [442] IW4 data #3, 6 slices BG44 [8323] IW4 data #4, 7 slices
These numbers translated back to jbig2 and jpeg2000 (via djvutoy, but probably straighforward to rewrite):
robert@robert-virtual-machine:~/didjvu$ pdfimages -list bankstatementdifg12.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1732 2334 rgb 3 8 jpx no 1 0 200 201 15.6K 0.1%
1 1 image 433 584 rgb 3 8 jpx no 2 0 50 51 63.8K 8.6%
1 2 smask 5196 7001 gray 1 1 jbig2 no 2 0 600 600 17.0K 0.4%
ls -lsh bankstatementdifg12.pdf
100K -rw------- 1 robert robert 98K jun 13 23:43 bankstatementdifg12.pdf
For some reason the 13k foreground-picture in the resulting PDF is still quite big (64k) for such a small picture. There must be room for improvement.
For some reason the 13k foreground-picture in the resulting PDF is still quite big (64k) for such a small picture. There must be room for improvement.
Yeah, something seems wrong there. Can you share the changes you made to get to this point?
Alternatively, you can just use recode_pdf
, as it currently allows toying with the parameters more.
For my understanding, what is the your end goal / use case specifically? To get DjVu like compression in PDFs?
I only added fg_downsample=12 inside compress-pdf-images close to fg_downsample=3
I think the most convenient goal for me would be to be able to scan in paperwork that is handed over to me or send to me by mail, to be able to distribute it again and store it electronically without clogging a lot of mailboxes.
Usually it concerns letters with some message and some logo and sometimes even an autograph.
DjVu shows how small you can go with what quality, so it shows room for improvement as far as PDF permits. There are cases where some AI probably would perform better, but DjVu has had quite a lot of development and finetuning in the past, so could be an example as far as patents and copyright permits.
I just tried some bg_slope values, and 43000 results in this: ocrmypdf_compkakadufullfgbgslope43000.pdf
robert@robert-virtual-machine:~/Downloads$ pdfimages -list ocrmypdf_compkakadufullfgbgslope43000.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 826 1166 rgb 3 8 jpx no 16 0 100 100 19.2K 0.7%
1 1 image 2480 3500 rgb 3 8 jpx no 17 0 300 300 22.5K 0.1%
1 2 smask 2480 3500 gray 1 1 jbig2 no 17 0 300 300 29.4K 2.8%
Where the bg image has a proportional size to the other 2 images and the logo on the left improves a lot.
Right, there's room to toy with some of this. It depends on how well the binarisation algorithm works, if it knows about the DPI, and in general the specific content being compressed. The default settings for all archive.org books are:
fg_slope 45000
bg_slope 44250
, with 3x downsamplingThis should correspond mostly to the current defaults of the tool.
In some cases where we want might higher accuracy at the cost of compression, we use:
fg slope 44500
bg_slope 43500
, with no bg downsamplingI saw the didjvu generate layers, that's definitely interesting for certain content. The hard part (for me) is automatically knowing what content I am dealing with -- which is why the parameters are generalised for all cases, which means they're not optimal for any particular use case, but don't need tweaking.
Hi, I'm interested in the pdfcomp
tool to get the MRC compression de-coupled from the HOCR rendering. I'm unclear on the current status, or how I'd try it out. Installing from pip
does not appear to give me a pdf_comp
executable (version 1.5.2). (I am fairly new to python). Is it only available in master (or another non-master branch?). Are there instructions for trying it out? How finished is it considered, and is there any ongoing work on it?
My interest is sparked by realizing that recode_pdf
s HOCR rendering (internetarchivepdf 1.5.2) is not as good as what tesseract 5.3.0 is rendering itself. recode_pdf's line-heights are not as well-fit, sometimes extremely so, and are sometimes not as well placed. This is indeed curious, and I could file a separate issue with reproducible details if you are interested.
But, so, it would be nice to take PDFs with HOCR rendered/positioned by tesseract (perhaps using tesseract's textonly_pdf
feature), and then apply MRC encoding with a tool like pdf_comp
, to de-couple HOCR rendering from MRC compression. tesseract
itself of course does not do MRC compression.
As a side note, I think in some discussion of extracting the MRC functionality, perhaps over in the OCRMyPDF repo, there was some consideration of supporting MRC using alternate compression algorithm so jbig2enc
is not required? I have been having trouble installing jbig2enc in some environments, and am also somewhat confused by it's mostly-unmaintained status with several different forks with different possible bugfixes/improvements. I'd be interested if archive-pdf-tools wanted to support an alternative to this dependency (or if you wanted to package a "blessed" and maintained version with archive-pdf-tools!).
Hi @jrochkind - I'll try to get back to you with some instructions on how to try it out. A few brief answers right now:
I'm surprised that the hOCR -> PDF rendering is different in your experience. I'd definitely like to have that fixed, since I wrote the code based on the Tesseract code. Maybe I'm behind on some fixes. Please do file a separate issue for that.
Technically the compression can be decoupled from hOCR, but I am not sure about the results we would get without hOCR.
You can use ccitt
instead of JBIG2, which not quite as good as JBIG2, but should just work. You can use --mask-compression ccitt
for this, but it won't compress quite as well (it's still pretty good, though)
Jbig2enc has to be compiled manually against the right version of Libleptonica as there is no packaged version. As far as I can see jbig2enc is updated until Libleptonica 1.83 on Januari 9th:
https://github.com/agl/jbig2enc/commit/ea050190466f5336c69c6a11baa1cb686677fcab
Thanks for the quick answer!
I will file a separate ticket about my HOCR rendering findings.
Yes, by that I only meant exactly what I understand the pdfcomp
tool is doing -- extract the text from PDF, for instance. I meant, using a different tool to render the HOCR than to apply the MRC compression, so you can mix-and-match "best in breed" -- exactly what I understand pdfcomp
already supports.
Awesome for ccitt info, I will make a note of that and try it out.
I was also curious -- if the pdf we are giving to pdfcomp
already has a JPG in it for the raster image, are we worried about lossy=>lossy further image losses from the pdfcomp
process?
(And as an aside not relevant for this ticket, but I wasn't sure where to ask it -- I'm curious if anyone has managed to get archive-pdf-tools installed on MacOS, or has any idea of whether that might even be feasible. I have had no luck, and was guessing that it's not intended for that and not feasible without a lot of work).
As Mac os is a kind of Unix I would expect all components to be compilable, all sources are available, but I don't know whether anyone has spent the effort to make it a smooth process, and as I posess no Mac or Hackintosh I can't try. As there are efforts to enable Linux on M1 there might be shorter virtualization routes. If you don't fear the size of the result there is a way OCRMyPDF can keep a JPG and add OCR'ed text. Then you won't need MRC.
MacOS supports these via Homebrew: https://ocrmypdf.readthedocs.io/en/latest/jbig2.html
Yep, jbig2 wasn't actually the problem on MacOS. I'll open a separate issue about that, just to keep track of it for any other interested parties, since it's really a separate thing, sorry for bringing it up here.
I was also curious -- if the pdf we are giving to pdfcomp already has a JPG in it for the raster image, are we worried about lossy=>lossy further image losses from the pdfcomp process?
I suppose somewhat, but the whole process is lossy anyway. The better the input image quality, the better the output will be. Feels a bit like garbage in - garbage out. I am not sure if there is a way to fix this
(And as an aside not relevant for this ticket, but I wasn't sure where to ask it -- I'm curious if anyone has managed to get archive-pdf-tools installed on MacOS, or has any idea of whether that might even be feasible. I have had no luck, and was guessing that it's not intended for that and not feasible without a lot of work).
I'm happy to try to help you get set up with this. We do build MacOS wheels, but I've personally never tested them. (I only use Linux). Maybe in a separate issue? EDIT: Just saw you already made an issue for it. :)
The tool needs command line arguments much like
recode_pdf
(which we might want to rename) - and probably those flags out to be shared mostly.Let's also use this to discuss issues of people testing pdfcomp now.