ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.44k stars 985 forks source link

6x file size increase with industrial scanner (mixed jpg ccitt, many stencils in file?) #269

Open randfee opened 6 years ago

randfee commented 6 years ago

The output PDF increases more than a factor of six for files scanned with our industrial scanner/copier as soon as I add deskew or clean-final. Here's a demo file: Instructions.pdf

ocrmypdf -l eng Instructions.pdf Instructions_justOCR.pdf -->output-size = 1.55x input size

ocrmypdf -l eng --deskew Instructions.pdf Instructions_OCR&deskew.pdf -->output-size = 6.57x input size

it might have something to do with the file having jpg pages as well as ccitt ones. Also, the files have multiple stencils per page. I check for that using: pdfimages -list Instructions.pdf

If I had to guess, it'd make sense to ignore all those stencils and just go for the images.

randfee@ubuntu:/TestScans$ pdfimages -list Instructions.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
   1     0 image    1240   874  gray    1   8  jpeg   no       109  0   150   150 16.9K 1.6%
   1     1 stencil  2304   248  -       1   1  ccitt  no       110  0   301   300 9758B  14%
   1     2 stencil   696   128  -       1   1  ccitt  no       111  0   300   300 1449B  13%
   1     3 stencil   632   132  -       1   1  ccitt  no       112  0   300   300  663B 6.4%
   1     4 stencil    24    12  -       1   1  ccitt  no       113  0   300   300   19B  53%
   1     5 stencil    24    88  -       1   1  ccitt  no       114  0   300   300   47B  18%
   1     6 stencil    24    12  -       1   1  ccitt  no       115  0   300   300   20B  56%
   1     7 stencil   152    24  -       1   1  ccitt  no       116  0   300   300   52B  11%
   1     8 stencil    40    32  -       1   1  ccitt  no       117  0   300   300   61B  38%
   2     9 image    1240   874  gray    1   8  jpeg   no         3  0   150   150 22.9K 2.2%
   2    10 stencil    40   140  -       1   1  ccitt  no         4  0   300   300   86B  12%
   2    11 stencil  1704   192  -       1   1  ccitt  no        11  0   300   300 3508B 8.6%
   2    12 stencil  1480   352  -       1   1  ccitt  no        12  0   300   300 4007B 6.2%
   2    13 stencil   640    60  -       1   1  ccitt  no        13  0   300   300  665B  14%
   2    14 stencil    40    40  -       1   1  ccitt  no        14  0   300   300   54B  27%
   2    15 stencil    48    48  -       1   1  ccitt  no        15  0   300   300   49B  17%
   2    16 stencil    48    48  -       1   1  ccitt  no        16  0   300   300   33B  11%
   2    17 stencil    32    44  -       1   1  ccitt  no        17  0   300   300   46B  26%
   2    18 stencil    48    96  -       1   1  ccitt  no        18  0   300   300   71B  12%
   2    19 stencil    56    12  -       1   1  ccitt  no         5  0   300   300   49B  58%
   2    20 stencil    32    24  -       1   1  ccitt  no         6  0   300   300   19B  20%
   2    21 stencil    48    12  -       1   1  ccitt  no         7  0   300   300   53B  74%
   2    22 stencil    24    12  -       1   1  ccitt  no         8  0   300   300   24B  67%
   2    23 stencil    24    12  -       1   1  ccitt  no         9  0   300   300   38B 106%
   2    24 stencil    16    60  -       1   1  ccitt  no        10  0   300   300   37B  31%
   3    25 image    1240   874  gray    1   8  jpeg   no        22  0   150   150 8946B 0.8%
   3    26 stencil   536   204  -       1   1  ccitt  no        23  0   301   300 1095B 8.0%
   3    27 stencil  2032   916  -       1   1  ccitt  no        24  0   300   300 14.5K 6.4%
   3    28 stencil    88    88  -       1   1  ccitt  no        25  0   300   300  173B  18%
   3    29 stencil    64    80  -       1   1  ccitt  no        26  0   300   300   86B  13%
   3    30 stencil    24    20  -       1   1  ccitt  no        27  0   300   300   27B  45%
   3    31 stencil    48    12  -       1   1  ccitt  no        28  0   300   300   74B 103%
   4    32 image    1240   874  gray    1   8  jpeg   no        32  0   150   150 11.8K 1.1%
   4    33 stencil  1920   484  -       1   1  ccitt  no        33  0   300   300 9223B 7.9%
   4    34 stencil   648    48  -       1   1  ccitt  no        35  0   300   300  666B  17%
   4    35 stencil   544    16  -       1   1  ccitt  no        36  0   300   300   24B 2.2%
   4    36 stencil   304    12  -       1   1  ccitt  no        37  0   300   300  136B  30%
   4    37 stencil    32    44  -       1   1  ccitt  no        38  0   300   300   54B  31%
   4    38 stencil    40    44  -       1   1  ccitt  no        39  0   300   300   49B  22%
   4    39 stencil    24    12  -       1   1  ccitt  no        40  0   300   300   17B  47%
   4    40 stencil    24    12  -       1   1  ccitt  no        41  0   300   300   17B  47%
   4    41 stencil    24    16  -       1   1  ccitt  no        42  0   300   300   26B  54%
   4    42 stencil   264    12  -       1   1  ccitt  no        34  0   300   300   17B 4.3%
   5    43 image    1240   874  gray    1   8  jpeg   no        46  0   150   150 13.8K 1.3%
   5    44 stencil  1984   448  -       1   1  ccitt  no        47  0   300   300 5673B 5.1%
   5    45 stencil   648    48  -       1   1  ccitt  no        48  0   300   300  634B  16%
   5    46 stencil   408    12  -       1   1  ccitt  no        49  0   300   300   17B 2.8%
   5    47 stencil    96    44  -       1   1  ccitt  no        50  0   300   300  108B  20%
   5    48 stencil    32    44  -       1   1  ccitt  no        51  0   300   300   49B  28%
   5    49 stencil    64    80  -       1   1  ccitt  no        52  0   300   300  107B  17%
   5    50 stencil    24    20  -       1   1  ccitt  no        53  0   300   300   31B  52%
   6    51 image    1240   874  gray    1   8  jpeg   no        57  0   150   150 16.3K 1.5%
   6    52 stencil  2000   712  -       1   1  ccitt  no        58  0   300   300 12.2K 7.0%
   6    53 stencil    64    84  -       1   1  ccitt  no        59  0   300   300  114B  17%
   6    54 stencil    16    20  -       1   1  ccitt  no        60  0   300   300   20B  50%
   6    55 stencil    40    24  -       1   1  ccitt  no        61  0   300   300   43B  36%
   6    56 stencil   680    64  -       1   1  ccitt  no        62  0   300   300 1075B  20%
   7    57 image    1240   874  gray    1   8  jpeg   no        66  0   150   150 22.4K 2.1%
   7    58 stencil  2096   776  -       1   1  ccitt  no        67  0   300   301 8053B 4.0%
   7    59 stencil   112    80  -       1   1  ccitt  no        72  0   300   300  120B  11%
   7    60 stencil    40    44  -       1   1  ccitt  no        73  0   300   300   44B  20%
   7    61 stencil    32    40  -       1   1  ccitt  no        74  0   300   300   50B  31%
   7    62 stencil    64    44  -       1   1  ccitt  no        75  0   300   300   80B  23%
   7    63 stencil    48    32  -       1   1  ccitt  no        76  0   300   300   63B  33%
   7    64 stencil    24    20  -       1   1  ccitt  no        77  0   300   300   25B  42%
   7    65 stencil    48    56  -       1   1  ccitt  no        78  0   300   300  123B  37%
   7    66 stencil    48    56  -       1   1  ccitt  no        79  0   300   300  124B  37%
   7    67 stencil    32    36  -       1   1  ccitt  no        68  0   300   301   38B  26%
   7    68 stencil    32    36  -       1   1  ccitt  no        69  0   300   301   43B  30%
   7    69 stencil    32    36  -       1   1  ccitt  no        70  0   300   301   45B  31%
   7    70 stencil    16    44  -       1   1  ccitt  no        71  0   300   300   32B  36%
   8    71 image    1240   874  gray    1   8  jpeg   no        83  0   150   150 22.2K 2.1%
   8    72 stencil  2008  1164  -       1   1  ccitt  no        84  0   300   300 17.9K 6.3%
   8    73 stencil   224    12  -       1   1  ccitt  no        85  0   300   300  183B  54%
   8    74 stencil    16   168  -       1   1  ccitt  no        86  0   300   300   79B  24%
   8    75 stencil    96    80  -       1   1  ccitt  no        87  0   300   300   84B 8.8%
   8    76 stencil    32    16  -       1   1  ccitt  no        88  0   300   300   19B  30%
   8    77 stencil   120    40  -       1   1  ccitt  no        89  0   300   300  302B  50%
   8    78 stencil    16    28  -       1   1  ccitt  no        90  0   300   300   24B  43%
randfee commented 6 years ago

just now realised, this seems similar to #261

jbarlow83 commented 6 years ago

Yes, it's the same

jbarlow83 commented 6 years ago

Never mind, this case turned out to be a bit different due to the internal structure of the PDF.

The version 7.x changes do not help for this file. The JPEG resolution (150x150) is half that of the stencils (300x300), and to deskew I have to rasterize everything at the resolution of the stencils.

I have thought about implementing deskew by instructing the PDF viewer to draw the page contents at an angle that compensates for skew, but I strongly suspect some second tier viewers might not handle this well. Could also have some undesirable effects for say, a digitally inserted watermark that is perfectly aligned on top of a scanned page.... I could see how that goes.

Image segmentation and color reduction at the output side would also fix, effectively by reconstructing a similarly optimized PDF.