ho-tex / pdfcrop

pdfcrop perl utility
LaTeX Project Public License v1.3c
25 stars 6 forks source link

Reproducible / Deterministic PDF output not possible #4

Open schreiberx opened 3 years ago

schreiberx commented 3 years ago

A deterministic PDF output is important for e.g. binary diff of the output files which have actually the same content as well as for automatically generated pdfs stored to repositories. Using pdfcrop would generate PDFs with a varying ID which is not even possible to be removed by editing the meta information.

The varying ID is caused by the varying temporary name of the texfile used by e.g. pdftex. A program option to prefix the tex source code would \pdftrailerid{0} solves this problem.

u-fischer commented 3 years ago

Hm, yes sounds like a good idea. How do you set the creation date etc? With SOURCE_DATE_EPOCH? Or do you need an option here too?

schreiberx commented 3 years ago

Yes, SOURCE_DATE_EPOCH was my first attempt. This also worked as far as I remember. Currently, I'm simply using the script below to remove all other meta information of a pdf. It's not really nice, but it works. Only the varying ID by using 'pdfcrop' turned out to be the final problem. A more general solution in 'pdfcrop' might be to provide the possibility to add arbitrary tex code to the one which is automatically generated. Then everyone can add whatever tex commands he/she likes.

! /bin/bash

TEMPFILE_QPDF=mktemp

for i in find $@ -name "output_*.pdf"; do echo "+ Removing metadata: '$i'"

    # Remove metadata
    OUTPUT=`exiftool -overwrite_original -all= "$i" 2>&1`
    EXIT_CODE=$?

    if [[ $EXIT_CODE != 0 ]]; then
            echo "************************************"
            echo "An error occurred in 'exiftool' with exit code $EXIT_CODE"
            echo ""
            echo "$OUTPUT"
            echo "************************************"
            exit $EXIT_CODE
    fi

    OUTPUT=`qpdf --deterministic-id "$i" "$TEMPFILE_QPDF" 2>&1`
    EXIT_CODE=$?
    if [[ $EXIT_CODE != 0 ]]; then
            echo "************************************"
            echo "An error occurred in 'qpdf' with exit code $EXIT_CODE"
            echo ""
            echo "$OUTPUT"
            echo "************************************"
            exit $EXIT_CODE
    fi

    mv "$TEMPFILE_QPDF" "$i" || exit 1

done

schreiberx commented 3 years ago

Dear Ulrike, I just figured out another corner case where the bit-wise reproducibility is not given. It's not only \pdftrailerid{0} which needs to be added, but also \pdfsuppressptexinfo15 to ensure a deterministic output. Cheers, Martin

habere-et-dispertire commented 1 year ago

Pandoc is trying to address this issue too. They have different experimental solutions for the different Tex engines which you may find helpful. :-)