matteosecli / pdf2archive

A simple Ghostscript-based PDF to PDF/A-1B converter.
GNU General Public License v3.0
77 stars 16 forks source link

Converting breaks hyperlinks and crossreference links #12

Open einemarc opened 4 years ago

einemarc commented 4 years ago

Hi.

Thanks for sharing pdf2archive. I tested it and it worked on all the PDF documents I tried to convert. However, I noticed that hyperlinks and crossreference links (e. g. links to other pages in the same document) did not work after converting. It seems as if such links get deleted during the conversion.

The following test files exemplify this better. Original file: pdf2archive-conversion-test.pdf Converted file: pdf2archive-conversion-test-PDFA.pdf

Do you know the reason for this behavior and how easy it would be to resolve it?

Best, Marc

matteosecli commented 4 years ago

Hi @einemarc, thanks for the report! Which version of Ghostscript are you using? I've dug a little bit and since one of the three links is working I suspect it could be related to this: https://tex.stackexchange.com/q/456896

Can you please try adding the option -dPrinted=false to the commands in pdf2archive#L338-L356, i.e. so that those lines become

#=====# DO THE ACTUAL CONVERSION #=====#
echo "  Compressing PDF & embedding fonts..."
run gs $MSGOPTS \
    -dBATCH -dNOPAUSE -dNOOUTERSAVE \
    -dCompatibilityLevel=1.4 \
    -dPrinted=false \
    -dEmbedAllFonts=true -dSubsetFonts=true \
    -dCompressFonts=true -dCompressPages=true \
    -dUseCIEColor -sColorConversionStrategy=RGB \
    -dDownsampleMonoImages=false -dDownsampleGrayImages=false -dDownsampleColorImages=false \
    -dAutoFilterColorImages=false -dAutoFilterGrayImages=false \
    -sDEVICE=pdfwrite \
    -sOutputFile=$TMPFILE $INPUT
echo "  Converting to PDF/A-1B..."
run gs $MSGOPTS \
    -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
    $QUALITYOPTS \
    -dCompatibilityLevel=1.4 -dPDFACompatibilityPolicy=1 \
    -dPrinted=false \
    -dUseCIEColor -sProcessColorModel=DeviceRGB -sColorConversionStrategy=RGB \
    -sOutputICCProfile=$ICCTMPFILE \
    -sDEVICE=pdfwrite \
    -sOutputFile=$OUTPUT $TMPFILE $PSTMPFILE
echo "  Removing temporary files..."
rm $TMPFILE
echo "  Done, now ESSE3 is happy! ;)"

and try the conversion again? I'll try on my own as well asap.

I've tried to convert your document with Adobe Acrobat as well, and indeed the links (both internal and external) are supposed to be working in the PDF/A-1b version: pdf2archive-conversion-test-PDFA_Acrobat.pdf


Unfortunately I didn't have so much time to keep this project up-to date in the last couple of years, so it's quite possible that new versions of Ghostscript break something --- like for example all the warnings about -dUseCIEColor in newer versions, that I have to fix since without that options one doesn't get a good conversion. I'll try to find some time to do some upgrades to this code, since it seems it can still be useful! 😉

matteosecli commented 4 years ago

I can confirm this is Ghostscript's intended behavior since 9.24 or so, and that the fact that in previous versions of Ghostscript those links were instead working was indeed a bug.

As reported in GS bug 699830:

If a /Link Annotation has the 'Print' bit of the annotations /Flags value set, then the PDF interpreter will (by default) not process the annotation. If the PDF interpreter skips the annotation then the pdfwrite device doesn't ever see it.

If the Annotation (of whatever kind) does set the Print bit, then (again in default setup) the PDF interpreter will process the annotation and pass it to the pdfwrite device.

You can change the behaviour of the interpreter. If you set -dPrinted=false, then the interpreter no longer cares about the Print bit of the annotation flag. In this mode it instead checks the NoView bit, if that isn't set, thenit processes the annotation.

In this case, if the NoView bit was set, then the annotation would be skipped.

[...]

[...] the control (-dPrinted) which was supposed to control whether or not the annotation is processed was being ignored. Obviously that's not the way it was supposed to work, and has been fixed.

In principle, then, setting -dPrinted=false will keep the non-printing hyperlinks. However, since we are further converting to PDF/A, these annotations will be dropped anyway because they are not allowed by the PDF/A standard. By using the --debug flag you get indeed:

...
Processing pages 1 through 2.
Page 1
GPL Ghostscript 9.52: Annotation set to non-printing,
 not permitted in PDF/A, annotation will not be present in output file
GPL Ghostscript 9.52: Annotation set to non-printing,
 not permitted in PDF/A, annotation will not be present in output file
GPL Ghostscript 9.52: Annotation set to non-printing,
 not permitted in PDF/A, annotation will not be present in output file
...

The reason these links get preserved in Adobe's conversion is that since Acrobat Pro 9 the annotations get flattened before getting converted to PDF/A, which in our case means that "non-printing" annotations get set to "printing" as well before saving to PDF/A. I have to figure out a way to flatten the annotations, then.

matteosecli commented 4 years ago

I've dug further and unfortunately I haven't found any way to get around this. See for example: https://bugs.ghostscript.com/show_bug.cgi?id=699582#c2

Additionally, it seems that -dPrinted=false does not retain print annotations, so it's necessarily a choice between print or screen annotations, not both. Having no effect on the PDF/A conversion, I would then avoid adding this flag. Also -dPreserveAnnots=false doesn't seem to be useful in this case.

I see the PDF was produced with LibreOffice. Have you tried to compare with LibreOffice's PDF/A output (see e.g. here)?

The only advice I can give you is to use, if you are producing PDFs via LaTeX, the pdfa option of hyperref (see https://tex.stackexchange.com/a/456958):

\usepackage[pdfa]{hyperref}

Other resources on the creation (or at least on the preparation for the best possible conversion) of a PDF/A document, that I should properly document on the README: https://github.com/matteosecli/pdf2archive/issues/3


I'll keep this open for now as a reminder, but I currently have no solution. If at a certain point someone comes up with a proper way of flattening PDF annotations with Ghostscript, I can definitely take a look again.

einemarc commented 4 years ago

Thank you for your comprehensive responses, research and testing.

LibreOffice's PDF/A export feature works fine as far as I know, but I only used it for creating the test file. I probably should have provided more context. I regularly convert a good number of PDF documents (mostly articles from academic journals, which's PDF creation process I cannot influence) to the PDF/A-1b standard. I use Adobe Acrobat's Preflight tool to check each document for compatibility and then convert it using the appropriate corrections and 'setting' the document to PDF/A-1b. I was looking for a simple, free and open-source tool to automate that so I can batch-convert PDF files.

I didn't know how Adobe Acrobat preserves the links, thanks for the explanation. Truthfully, I have always avoided reading up on the nitty gritty of PDF and the PDF/A standard once I realised how vast and complex it is. At least I now know what "flattening" means. ;) I still wonder why Adobe does not provide a tool which can automatically detect what to correct (and what not) in a PDF file for converting it to PDF/A. (Or is there such a software/tool?)

I used Ghostscript 9.27, so a version with the behavior you referred to. I also tried the -dPrinted=false setting and got the same unchanged end result.

If there is a solution at one point, I would appreciate it. But don't worry too much. Unfortunately my programming skills and my knowledge about all this are not at a point where I could help and try to solve the issue myself. Thanks again.

matteosecli commented 4 years ago

I managed to successfully convert your sample file to PDF/A and keep the hyperlinks at the same time.

I basically added an /F 4 flag to the annotations via the following sed command:

cat "pdf2archive-conversion-test.pdf" | LC_ALL=C sed 's:/Type/Annot:/Type/Annot/F 4:g' > "pdf2archive-conversion-test-flagged.pdf"

This is the processed file: pdf2archive-conversion-test-flagged.pdf

This is the file diff after processing:

$ colordiff -a pdf2archive-conversion-test.pdf pdf2archive-conversion-test-flagged.pdf 
204c204
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[56.693 670.089 263.007 683.889]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[56.693 670.089 263.007 683.889]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
209c209
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[184.993 642.489 216.707 656.289]/Dest[4 0 R/XYZ 56.7 773.189 0]>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[184.993 642.489 216.707 656.289]/Dest[4 0 R/XYZ 56.7 773.189 0]>>
213c213
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[56.693 697.689 114.057 711.489]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[56.693 697.689 114.057 711.489]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>

I've wrapped the sed command into a shell script that takes a pdf file as a single argument: makeannotprint.txt (remove the .txt extension after downloading)

$ ./makeannotprint yourfile.pdf

and generates a new file yourfile-flagged.pdf. Note that the script expects a single argument with a .pdf file extension but it does no checks whatsoever. What's more, the sed command appends the /F 4 flag whether or not the flag is already present and it does so for every kind of annotation, not only for the links. So that command is potentially destructive and I'm using it just as a proof-of-concept. 🙂

Additionally, this is definitely not the way to properly edit the internals of a PDF. Indeed, when processing the flagged file with pdf2archive --debug, Ghostscript complains about a broken PDF file (issues with the xref table) and emits a warning saying it'll try to repair the PDF, but the output file could have issues.

Nonetheless, if you then process the flagged file via

$ ./pdf2archive pdf2archive-conversion-test-flagged.pdf

it seems that the output is a valid PDF/A with working links: pdf2archive-conversion-test-flagged-PDFA.pdf

So in principle, with the caveat that this workaround can lead to unexpected results, you could first process your input file input.pdf with makeannotprint and then feed it to pdf2archive:

$ ./makeannotprint input.pdf
$ ./pdf2archive input-flagged.pdf input-PDFA.pdf
$ rm input-flagged.pdf

or even add the sed command to the conversions steps inside pdf2archive.

I could indeed add an experimental option to pdf2archive that allows one to automatically do such a brutal edit of the input file, but I should first properly document myself about the possible side effects and/or about a cleaner way to perform such an edit.

If you decide to pre-process your files via that brutal sed command and get any useful insights or just a feedback on the results when converting multiple documents, please drop a line on this issue as that can be useful. 😉