change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode`

maadjordan commented 5 years ago

I've found this https://www.usmodernist.org/AF/AF-1928-01-1.PDF

which seems that all scanned jfif images are stored as deflated dct stream. is possible to strip the deflating code safely or transfer it into dct only stream. uncompress it will still preserve this code and using PSO will strill run it through deflate optimizing

zvezdochiot commented 5 years ago

$ mutool info AF-1928-01-1.PDF 
AF-1928-01-1.PDF:

PDF-1.4

Pages: 202

Retrieving info from pages 1-202...
Mediaboxes (135):
    1   (97 0 R):   [ 0 0 619.56 878.04 ]
    3   (7 0 R):    [ 0 0 623.16 884.16 ]
...
    199 (1633 0 R): [ Flate DCT ] 1691x2457 8bpc DevRGB (1637 0 R)
    200 (1639 0 R): [ Flate DCT ] 1687x2454 8bpc DevRGB (1643 0 R)
    201 (1645 0 R): [ Flate DCT ] 1681x2448 8bpc DevRGB (1649 0 R)
    202 (1651 0 R): [ Flate DCT ] 1673x2443 8bpc DevRGB (1655 0 R)

See https://github.com/pts/pdfsizeopt/issues/95

zvezdochiot commented 5 years ago

@maadjordan say> using PSO will strill run it through deflate optimizing

You can:

use pdfimages (https://github.com/freedesktop/poppler) to extract images:
```
pdfimages -j AF-1928-01-1.PDF i
```
use jpegquant (https://github.com/ImageProcessing-ElectronicPublications/jpegquant) to reduce DCT coefficients (lossy):
```
mkdir jq25; for tjpg in *.jpg; do jpegquant -q 25 "$tjpg" "jq25/$tjpg"; done
```
use jpegrescan (https://github.com/kud/jpegrescan) to optimize compression of DCT coefficients (lossless):
```
mkdir jr; for tjpg in *.jpg; do jpegrescan "$tjpg" "jr/$tjpg"; done
```
use img2pdf (https://github.com/josch/img2pdf) to convert (lossless):
```
for tjpg in *.jpg; do img2pdf -d 200 -o "$tjpg.pdf" "$tjpg"; done
```

use pdftk (https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) to merge pages:

pdftk *.pdf cat output book.pdf

PS: OCR layer will be lost.

$ ls -l
-rw-r--r-- 1 user user 94997395 Jul 21 09:23 AF-1928-01-1.PDF
-rw-r--r-- 1 user user 34663639 Jul 21 09:50 book.pdf

maadjordan commented 5 years ago

thanks for the prompt reply. I managed to find a windows compile of "pdfimages" but not img2pdf, jpegquant or jpegscan.

jpegquant and jpegscan can be replaced with jpegrecompress and mozijpeg for lossy or lossless optimization.

Can you provide a link to latest compiled version of img2pdf ?

also some images are CCITT which is not viewable in Xnview. is there a way to view these? these are not recognized by PSO to passthrough Jbig2 encoder?

zvezdochiot commented 5 years ago

@maadjordan say> Can you provide a link to latest compiled version of img2pdf ?

Img2pdf is a python script using the PIL library. How the python support works in your OS is unknown to me. There is no such problem in Debian.

maadjordan commented 5 years ago

it could be like pso exe files. its python wrapped into exe

zvezdochiot commented 5 years ago

@maadjordan say> it could be like pso exe files.

Maybe. Ask the developer: https://gitlab.mister-muffin.de/josch/img2pdf

maadjordan commented 5 years ago

I managed to compile img2pdf into windows exe file using https://gitlab.mister-muffin.de/josch/img2pdf/issues/8

zvezdochiot commented 5 years ago

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

pts commented 5 years ago

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

maadjordan commented 5 years ago

@maadjordan say> I managed to compile img2pdf

Instead of jpegrescan use Voralent Antelope.

https://www.google.com/search?q=Voralent+Antelope

its a GUI to jpegtrans, pnguant and other tools. nothing special.

maadjordan commented 5 years ago

FYI pdfsizeopt doesn't have any features right now to do JPEG (re)compression.

I know and I will be waiting for this feature.

My main question was to simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

zvezdochiot commented 5 years ago

@maadjordan say> I know and I will be waiting for this feature.

See https://github.com/pts/pdfsizeopt/issues/95

@pts say> It would be possible to add lossy optimizations (which can be enabled with a command-line flag) in general and lossy image optimizations with external tools such as jpeg-recompress in particular, but that would need substantial software development and maintenance work, and that would need either funding or volunteering (i.e. pull requests).

pts commented 5 years ago

Also on same pages i found ccitt streams deflated and PSO missed to pass the stream to Jbig2

This shouldn't be happening. maadjordan@, please report this as a separate issue, and attach the offending PDF file to the issue.

pts commented 5 years ago

simplify the file processing as jpg files are backed with deflate stream which means that reader need to inflate then read jpg files and both steps requires ram ! simplifying it would reduce ram considerably .. such feature is good to add.

OK, if I understand you correctly, you want pdfsizeopt to change /Filter [/FlateDecode /DCTDecode] to /Filter /DCTDecode (and also similarly for /Filter [/FlateDecode /JPXDecode]) after decompressing the flate-compressed stream.

This is possible to do, but it's unlikely to make the PDF file any smaller, and the overall goal of pdfsizeopt (with its default settings) to make PDF files smaller.

To make this happen, https://github.com/pts/pdfsizeopt/blob/33ec5e5c637fc8967d6d238dfdaf8c55605efe83/lib/pdfsizeopt/main.py#L8143 needs to adjusted to allow /DCTDecode and /JPXDecode, and GetUncompressedStream also need to be extended so that it won't try to decompress those streams. Also https://github.com/pts/pdfsizeopt/blob/33ec5e5c637fc8967d6d238dfdaf8c55605efe83/lib/pdfsizeopt/main.py#L8131 needs to be removed so that images are not automatically ignored.

I'm keeping this issue open in case anyone wants to pick up this work.

pts / pdfsizeopt

change `/Filter [/FlateDecode /DCTDecode]` to `/Filter /DCTDecode` #127