internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
97 stars 13 forks source link

pdfcomp: new tool, discussion, compression questions #51

Open MerlijnWajer opened 2 years ago

MerlijnWajer commented 2 years ago

The tool needs command line arguments much like recode_pdf (which we might want to rename) - and probably those flags out to be shared mostly.

Let's also use this to discuss issues of people testing pdfcomp now.

MerlijnWajer commented 2 years ago

Moving from https://github.com/ocrmypdf/OCRmyPDF/issues/541:

I downloaded the latest integrated pdfcomp and repeated the steps, with an old A4, not smudgy ING-bankstatement, scanned at 600 dpi straight to TIFF. The paper-structure is visible.

robert@robert-virtual-machine:~$ ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatement.tiff out.pdf
WARNING - --pdfa-image-compression argument has no effect when --output-type is not 'pdfa', 'pdfa-1', or 'pdfa-2'
   INFO - Input file is not a PDF, checking if it is an image...
   INFO - Input file is an image
   INFO - Input image has no ICC profile, assuming sRGB
   INFO - Image seems valid. Try converting to PDF...
   INFO - Successfully converted to PDF, processing...
Scan: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 112.61page/s]
   INFO - Using Tesseract OpenMP thread limit 3
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:40<00:00, 40.10s/page]
   INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 8.75× larger than the input file.
Possible reasons for this include:
Optimization was disabled.
robert@robert-virtual-machine:~$ pdfcomp out.pdf out_c.pdf
Compression factor: 47.39642946807007
robert@robert-virtual-machine:~$ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    5196  7001  rgb     3   8  image  no        10  0   600   600 21.4M  21%
robert@robert-virtual-machine:~$ pdfimages -list out_c.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2333  rgb     3   8  jpx    no        16  0   200   200 47.3K 0.4%
   1     1 image    5196  7001  rgb     3   8  jpx    no        17  0   600   600  355K 0.3%
   1     2 smask    5196  7001  gray    1   1  jbig2  no        17  0   600   600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh bankstatement.tiff out.pdf out_c.pdf
2,5M -rw-r----- 1 robert robert 2,5M jun 12 22:48 bankstatement.tiff
464K -rw-rw-r-- 1 robert robert 463K jun 12 22:54 out_c.pdf
 22M -rw-rw-r-- 1 robert robert  22M jun 12 22:52 out.pdf

I looked at the huge picture (image 1) of 355K, it should only containing the colorization, but is very detailed and huge.

It has kdu_compress installed.

I think the pdfcomp that you are using probably does not use kakadu, but rather openjpeg, for which the right defaults are still a bit up in the air. If you have kakadu installed, I can give you a build that uses kakadu, or you can wait just a bit for me to implement the same flags for pdfcomp that recode_pdf also has (regarding re-compression of images)

I have started a new github action build for the version that uses kakadu here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2484980161

rmast commented 2 years ago

Unfortunately no change whatsoever in compression factor:

robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu.pdf
Compression factor: 47.39642946807007

robert@robert-virtual-machine:~$ which pdfcomp
/usr/local/bin/pdfcomp
robert@robert-virtual-machine:~$ ls -al /usr/local/bin/pdfcomp
-rwxr-xr-x 1 root root 1003 jun 12 22:50 /usr/local/bin/pdfcomp

however

robert@robert-virtual-machine:~$ which compress-pdf-images
/home/robert/.local/bin/compress-pdf-images
robert@robert-virtual-machine:~$ ls -al /home/robert/.local/bin/compress-pdf-images
-rwxrwxr-x 1 robert robert 4922 jun 11 13:23 /home/robert/.local/bin/compress-pdf-images

I'll raw copy the latest version.

rmast commented 2 years ago

An improvement with the new raw copied version of compress-pdf-images

pdfcomp out.pdf out_ckakadu3.pdf
Compression factor: 105.51087492544183

robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu3.pdf
Compression factor: 105.51087492544183
robert@robert-virtual-machine:~$ pdfimages -list out_ckakadu3.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2333  rgb     3   8  jpx    no        16  0   200   200 10.6K 0.1%
   1     1 image    5196  7001  rgb     3   8  jpx    no        17  0   600   600  137K 0.1%
   1     2 smask    5196  7001  gray    1   1  jbig2  no        17  0   600   600 48.7K 1.1%
robert@robert-virtual-machine:~$ ls -lsh out_ckakadu3.pdf
208K -rw-rw-r-- 1 robert robert 208K jun 13 22:01 out_ckakadu3.pdf
rmast commented 2 years ago

DjVu uses about 25 dpi for the foreground-picture:

https://www.cs.tufts.edu/~nr/cs257/archive/leon-bottou/jei-1998.ps.gz

image

rmast commented 2 years ago

With fg_downsample=12 inside compress-pdf-images robert@robert-virtual-machine:~$ pdfcomp out.pdf out_ckakadu4.pdf Compression factor: 300.4531241641256

However one unacceptable artifact appears, the rest of the page is fine: image Where the original shows: image

robert@robert-virtual-machine:~$ pdfimages -list out_ckakadu4.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2333  rgb     3   8  jpx    no        16  0   200   200 10.6K 0.1%
   1     1 image     433   583  rgb     3   8  jpx    no        17  0    50    50 2183B 0.3%
   1     2 smask    5196  7001  gray    1   1  jbig2  no        17  0   600   601 48.7K 1.1%

robert@robert-virtual-machine:~$ ls -lsh out_ckakadu4.pdf
76K -rw-rw-r-- 1 robert robert 74K jun 13 22:35 out_ckakadu4.pdf

It however also appears the foreground is more faint, probably while the colors are more blurred with the surrounding background.

rmast commented 2 years ago

I've run the bankstatement in the fully open source didjvu, It does not have the faint text nor the strange artifact shown due to bad foreground/background choices.

robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy -o bankstatementdi.djvu ../bankstatement.tiff:

The resulting numbers:

FORM:DJVU [90592] 
    INFO [10]         DjVu 5196x7001, v24, 600 dpi, gamma=2.2
    Sjbz [16853]      JB2 bilevel data
    FG44 [41040]      IW4 data #1, 100 slices, v1.2 (color), 866x1167
    BG44 [6126]       IW4 data #1, 74 slices, v1.2 (color), 1732x2334
    BG44 [6580]       IW4 data #2, 10 slices
    BG44 [2044]       IW4 data #3, 6 slices
    BG44 [17878]      IW4 data #4, 7 slices

All BG44 slices together form the background picture.

The foreground picture default has half the resolution of the background.

When I diminish the foreground further to only 1/4 of the background I see no visual differences in the result:

robert@robert-virtual-machine:~/didjvu$ ./didjvu encode ../bankstatement.tiff -d 600 --lossy --fg-subsample 12 -o bankstatementdifg12.djvu ../bankstatement.tiff:

These numbers translated back to jbig2 and jpeg2000 (via djvutoy, but probably straighforward to rewrite):

robert@robert-virtual-machine:~/didjvu$ pdfimages -list bankstatementdifg12.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1732  2334  rgb     3   8  jpx    no         1  0   200   201 15.6K 0.1%
   1     1 image     433   584  rgb     3   8  jpx    no         2  0    50    51 63.8K 8.6%
   1     2 smask    5196  7001  gray    1   1  jbig2  no         2  0   600   600 17.0K 0.4%

ls -lsh bankstatementdifg12.pdf
100K -rw------- 1 robert robert 98K jun 13 23:43 bankstatementdifg12.pdf

For some reason the 13k foreground-picture in the resulting PDF is still quite big (64k) for such a small picture. There must be room for improvement.

MerlijnWajer commented 2 years ago

For some reason the 13k foreground-picture in the resulting PDF is still quite big (64k) for such a small picture. There must be room for improvement.

Yeah, something seems wrong there. Can you share the changes you made to get to this point?

Alternatively, you can just use recode_pdf, as it currently allows toying with the parameters more.

For my understanding, what is the your end goal / use case specifically? To get DjVu like compression in PDFs?

rmast commented 2 years ago

I only added fg_downsample=12 inside compress-pdf-images close to fg_downsample=3

I think the most convenient goal for me would be to be able to scan in paperwork that is handed over to me or send to me by mail, to be able to distribute it again and store it electronically without clogging a lot of mailboxes.

Usually it concerns letters with some message and some logo and sometimes even an autograph.

DjVu shows how small you can go with what quality, so it shows room for improvement as far as PDF permits. There are cases where some AI probably would perform better, but DjVu has had quite a lot of development and finetuning in the past, so could be an example as far as patents and copyright permits.

rmast commented 2 years ago

I just tried some bg_slope values, and 43000 results in this: ocrmypdf_compkakadufullfgbgslope43000.pdf

robert@robert-virtual-machine:~/Downloads$ pdfimages -list ocrmypdf_compkakadufullfgbgslope43000.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     826  1166  rgb     3   8  jpx    no        16  0   100   100 19.2K 0.7%
   1     1 image    2480  3500  rgb     3   8  jpx    no        17  0   300   300 22.5K 0.1%
   1     2 smask    2480  3500  gray    1   1  jbig2  no        17  0   300   300 29.4K 2.8%

Where the bg image has a proportional size to the other 2 images and the logo on the left improves a lot.

MerlijnWajer commented 2 years ago

Right, there's room to toy with some of this. It depends on how well the binarisation algorithm works, if it knows about the DPI, and in general the specific content being compressed. The default settings for all archive.org books are:

This should correspond mostly to the current defaults of the tool.

In some cases where we want might higher accuracy at the cost of compression, we use:

MerlijnWajer commented 2 years ago

I saw the didjvu generate layers, that's definitely interesting for certain content. The hard part (for me) is automatically knowing what content I am dealing with -- which is why the parameters are generalised for all cases, which means they're not optimal for any particular use case, but don't need tweaking.

jrochkind commented 1 year ago

Hi, I'm interested in the pdfcomp tool to get the MRC compression de-coupled from the HOCR rendering. I'm unclear on the current status, or how I'd try it out. Installing from pip does not appear to give me a pdf_comp executable (version 1.5.2). (I am fairly new to python). Is it only available in master (or another non-master branch?). Are there instructions for trying it out? How finished is it considered, and is there any ongoing work on it?

My interest is sparked by realizing that recode_pdfs HOCR rendering (internetarchivepdf 1.5.2) is not as good as what tesseract 5.3.0 is rendering itself. recode_pdf's line-heights are not as well-fit, sometimes extremely so, and are sometimes not as well placed. This is indeed curious, and I could file a separate issue with reproducible details if you are interested.

But, so, it would be nice to take PDFs with HOCR rendered/positioned by tesseract (perhaps using tesseract's textonly_pdf feature), and then apply MRC encoding with a tool like pdf_comp, to de-couple HOCR rendering from MRC compression. tesseract itself of course does not do MRC compression.

As a side note, I think in some discussion of extracting the MRC functionality, perhaps over in the OCRMyPDF repo, there was some consideration of supporting MRC using alternate compression algorithm so jbig2enc is not required? I have been having trouble installing jbig2enc in some environments, and am also somewhat confused by it's mostly-unmaintained status with several different forks with different possible bugfixes/improvements. I'd be interested if archive-pdf-tools wanted to support an alternative to this dependency (or if you wanted to package a "blessed" and maintained version with archive-pdf-tools!).

MerlijnWajer commented 1 year ago

Hi @jrochkind - I'll try to get back to you with some instructions on how to try it out. A few brief answers right now:

  1. I'm surprised that the hOCR -> PDF rendering is different in your experience. I'd definitely like to have that fixed, since I wrote the code based on the Tesseract code. Maybe I'm behind on some fixes. Please do file a separate issue for that.

  2. Technically the compression can be decoupled from hOCR, but I am not sure about the results we would get without hOCR.

  3. You can use ccitt instead of JBIG2, which not quite as good as JBIG2, but should just work. You can use --mask-compression ccitt for this, but it won't compress quite as well (it's still pretty good, though)

rmast commented 1 year ago

Jbig2enc has to be compiled manually against the right version of Libleptonica as there is no packaged version. As far as I can see jbig2enc is updated until Libleptonica 1.83 on Januari 9th:

https://github.com/agl/jbig2enc/commit/ea050190466f5336c69c6a11baa1cb686677fcab

jrochkind commented 1 year ago

Thanks for the quick answer!

  1. I will file a separate ticket about my HOCR rendering findings.

  2. Yes, by that I only meant exactly what I understand the pdfcomp tool is doing -- extract the text from PDF, for instance. I meant, using a different tool to render the HOCR than to apply the MRC compression, so you can mix-and-match "best in breed" -- exactly what I understand pdfcomp already supports.

  3. Awesome for ccitt info, I will make a note of that and try it out.

I was also curious -- if the pdf we are giving to pdfcomp already has a JPG in it for the raster image, are we worried about lossy=>lossy further image losses from the pdfcomp process?

(And as an aside not relevant for this ticket, but I wasn't sure where to ask it -- I'm curious if anyone has managed to get archive-pdf-tools installed on MacOS, or has any idea of whether that might even be feasible. I have had no luck, and was guessing that it's not intended for that and not feasible without a lot of work).

rmast commented 1 year ago

As Mac os is a kind of Unix I would expect all components to be compilable, all sources are available, but I don't know whether anyone has spent the effort to make it a smooth process, and as I posess no Mac or Hackintosh I can't try. As there are efforts to enable Linux on M1 there might be shorter virtualization routes. If you don't fear the size of the result there is a way OCRMyPDF can keep a JPG and add OCR'ed text. Then you won't need MRC.

rmast commented 1 year ago

MacOS supports these via Homebrew: https://ocrmypdf.readthedocs.io/en/latest/jbig2.html

jrochkind commented 1 year ago

Yep, jbig2 wasn't actually the problem on MacOS. I'll open a separate issue about that, just to keep track of it for any other interested parties, since it's really a separate thing, sorry for bringing it up here.

MerlijnWajer commented 1 year ago

I was also curious -- if the pdf we are giving to pdfcomp already has a JPG in it for the raster image, are we worried about lossy=>lossy further image losses from the pdfcomp process?

I suppose somewhat, but the whole process is lossy anyway. The better the input image quality, the better the output will be. Feels a bit like garbage in - garbage out. I am not sure if there is a way to fix this

(And as an aside not relevant for this ticket, but I wasn't sure where to ask it -- I'm curious if anyone has managed to get archive-pdf-tools installed on MacOS, or has any idea of whether that might even be feasible. I have had no luck, and was guessing that it's not intended for that and not feasible without a lot of work).

I'm happy to try to help you get set up with this. We do build MacOS wheels, but I've personally never tested them. (I only use Linux). Maybe in a separate issue? EDIT: Just saw you already made an issue for it. :)