Open einemarc opened 4 years ago
Hi @einemarc, thanks for the report! Which version of Ghostscript are you using? I've dug a little bit and since one of the three links is working I suspect it could be related to this: https://tex.stackexchange.com/q/456896
Can you please try adding the option -dPrinted=false
to the commands in pdf2archive#L338-L356, i.e. so that those lines become
#=====# DO THE ACTUAL CONVERSION #=====#
echo " Compressing PDF & embedding fonts..."
run gs $MSGOPTS \
-dBATCH -dNOPAUSE -dNOOUTERSAVE \
-dCompatibilityLevel=1.4 \
-dPrinted=false \
-dEmbedAllFonts=true -dSubsetFonts=true \
-dCompressFonts=true -dCompressPages=true \
-dUseCIEColor -sColorConversionStrategy=RGB \
-dDownsampleMonoImages=false -dDownsampleGrayImages=false -dDownsampleColorImages=false \
-dAutoFilterColorImages=false -dAutoFilterGrayImages=false \
-sDEVICE=pdfwrite \
-sOutputFile=$TMPFILE $INPUT
echo " Converting to PDF/A-1B..."
run gs $MSGOPTS \
-dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
$QUALITYOPTS \
-dCompatibilityLevel=1.4 -dPDFACompatibilityPolicy=1 \
-dPrinted=false \
-dUseCIEColor -sProcessColorModel=DeviceRGB -sColorConversionStrategy=RGB \
-sOutputICCProfile=$ICCTMPFILE \
-sDEVICE=pdfwrite \
-sOutputFile=$OUTPUT $TMPFILE $PSTMPFILE
echo " Removing temporary files..."
rm $TMPFILE
echo " Done, now ESSE3 is happy! ;)"
and try the conversion again? I'll try on my own as well asap.
I've tried to convert your document with Adobe Acrobat as well, and indeed the links (both internal and external) are supposed to be working in the PDF/A-1b version: pdf2archive-conversion-test-PDFA_Acrobat.pdf
Unfortunately I didn't have so much time to keep this project up-to date in the last couple of years, so it's quite possible that new versions of Ghostscript break something --- like for example all the warnings about -dUseCIEColor
in newer versions, that I have to fix since without that options one doesn't get a good conversion. I'll try to find some time to do some upgrades to this code, since it seems it can still be useful! 😉
I can confirm this is Ghostscript's intended behavior since 9.24 or so, and that the fact that in previous versions of Ghostscript those links were instead working was indeed a bug.
As reported in GS bug 699830:
If a /Link Annotation has the 'Print' bit of the annotations /Flags value set, then the PDF interpreter will (by default) not process the annotation. If the PDF interpreter skips the annotation then the pdfwrite device doesn't ever see it.
If the Annotation (of whatever kind) does set the Print bit, then (again in default setup) the PDF interpreter will process the annotation and pass it to the pdfwrite device.
You can change the behaviour of the interpreter. If you set -dPrinted=false, then the interpreter no longer cares about the Print bit of the annotation flag. In this mode it instead checks the NoView bit, if that isn't set, thenit processes the annotation.
In this case, if the NoView bit was set, then the annotation would be skipped.
[...]
[...] the control (-dPrinted) which was supposed to control whether or not the annotation is processed was being ignored. Obviously that's not the way it was supposed to work, and has been fixed.
In principle, then, setting -dPrinted=false
will keep the non-printing hyperlinks. However, since we are further converting to PDF/A, these annotations will be dropped anyway because they are not allowed by the PDF/A standard. By using the --debug
flag you get indeed:
...
Processing pages 1 through 2.
Page 1
GPL Ghostscript 9.52: Annotation set to non-printing,
not permitted in PDF/A, annotation will not be present in output file
GPL Ghostscript 9.52: Annotation set to non-printing,
not permitted in PDF/A, annotation will not be present in output file
GPL Ghostscript 9.52: Annotation set to non-printing,
not permitted in PDF/A, annotation will not be present in output file
...
The reason these links get preserved in Adobe's conversion is that since Acrobat Pro 9 the annotations get flattened before getting converted to PDF/A, which in our case means that "non-printing" annotations get set to "printing" as well before saving to PDF/A. I have to figure out a way to flatten the annotations, then.
I've dug further and unfortunately I haven't found any way to get around this. See for example: https://bugs.ghostscript.com/show_bug.cgi?id=699582#c2
Additionally, it seems that -dPrinted=false
does not retain print annotations, so it's necessarily a choice between print or screen annotations, not both. Having no effect on the PDF/A conversion, I would then avoid adding this flag. Also -dPreserveAnnots=false
doesn't seem to be useful in this case.
I see the PDF was produced with LibreOffice. Have you tried to compare with LibreOffice's PDF/A output (see e.g. here)?
The only advice I can give you is to use, if you are producing PDFs via LaTeX, the pdfa
option of hyperref (see https://tex.stackexchange.com/a/456958):
\usepackage[pdfa]{hyperref}
Other resources on the creation (or at least on the preparation for the best possible conversion) of a PDF/A document, that I should properly document on the README: https://github.com/matteosecli/pdf2archive/issues/3
I'll keep this open for now as a reminder, but I currently have no solution. If at a certain point someone comes up with a proper way of flattening PDF annotations with Ghostscript, I can definitely take a look again.
Thank you for your comprehensive responses, research and testing.
LibreOffice's PDF/A export feature works fine as far as I know, but I only used it for creating the test file. I probably should have provided more context. I regularly convert a good number of PDF documents (mostly articles from academic journals, which's PDF creation process I cannot influence) to the PDF/A-1b standard. I use Adobe Acrobat's Preflight tool to check each document for compatibility and then convert it using the appropriate corrections and 'setting' the document to PDF/A-1b. I was looking for a simple, free and open-source tool to automate that so I can batch-convert PDF files.
I didn't know how Adobe Acrobat preserves the links, thanks for the explanation. Truthfully, I have always avoided reading up on the nitty gritty of PDF and the PDF/A standard once I realised how vast and complex it is. At least I now know what "flattening" means. ;) I still wonder why Adobe does not provide a tool which can automatically detect what to correct (and what not) in a PDF file for converting it to PDF/A. (Or is there such a software/tool?)
I used Ghostscript 9.27, so a version with the behavior you referred to. I also tried the -dPrinted=false
setting and got the same unchanged end result.
If there is a solution at one point, I would appreciate it. But don't worry too much. Unfortunately my programming skills and my knowledge about all this are not at a point where I could help and try to solve the issue myself. Thanks again.
I managed to successfully convert your sample file to PDF/A and keep the hyperlinks at the same time.
I basically added an /F 4
flag to the annotations via the following sed command:
cat "pdf2archive-conversion-test.pdf" | LC_ALL=C sed 's:/Type/Annot:/Type/Annot/F 4:g' > "pdf2archive-conversion-test-flagged.pdf"
This is the processed file: pdf2archive-conversion-test-flagged.pdf
This is the file diff after processing:
$ colordiff -a pdf2archive-conversion-test.pdf pdf2archive-conversion-test-flagged.pdf
204c204
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[56.693 670.089 263.007 683.889]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[56.693 670.089 263.007 683.889]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
209c209
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[184.993 642.489 216.707 656.289]/Dest[4 0 R/XYZ 56.7 773.189 0]>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[184.993 642.489 216.707 656.289]/Dest[4 0 R/XYZ 56.7 773.189 0]>>
213c213
< <</Type/Annot/Subtype/Link/Border[0 0 0]/Rect[56.693 697.689 114.057 711.489]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
---
> <</Type/Annot/F 4/Subtype/Link/Border[0 0 0]/Rect[56.693 697.689 114.057 711.489]/A<</Type/Action/S/URI/URI(https://github.com/matteosecli/pdf2archive)>>
I've wrapped the sed
command into a shell script that takes a pdf file as a single argument: makeannotprint.txt (remove the .txt
extension after downloading)
$ ./makeannotprint yourfile.pdf
and generates a new file yourfile-flagged.pdf
. Note that the script expects a single argument with a .pdf
file extension but it does no checks whatsoever. What's more, the sed
command appends the /F 4
flag whether or not the flag is already present and it does so for every kind of annotation, not only for the links. So that command is potentially destructive and I'm using it just as a proof-of-concept. 🙂
Additionally, this is definitely not the way to properly edit the internals of a PDF. Indeed, when processing the flagged file with pdf2archive --debug
, Ghostscript complains about a broken PDF file (issues with the xref table) and emits a warning saying it'll try to repair the PDF, but the output file could have issues.
Nonetheless, if you then process the flagged file via
$ ./pdf2archive pdf2archive-conversion-test-flagged.pdf
it seems that the output is a valid PDF/A with working links: pdf2archive-conversion-test-flagged-PDFA.pdf
So in principle, with the caveat that this workaround can lead to unexpected results, you could first process your input file input.pdf
with makeannotprint
and then feed it to pdf2archive
:
$ ./makeannotprint input.pdf
$ ./pdf2archive input-flagged.pdf input-PDFA.pdf
$ rm input-flagged.pdf
or even add the sed
command to the conversions steps inside pdf2archive
.
I could indeed add an experimental option to pdf2archive
that allows one to automatically do such a brutal edit of the input file, but I should first properly document myself about the possible side effects and/or about a cleaner way to perform such an edit.
If you decide to pre-process your files via that brutal sed
command and get any useful insights or just a feedback on the results when converting multiple documents, please drop a line on this issue as that can be useful. 😉
Hi.
Thanks for sharing pdf2archive. I tested it and it worked on all the PDF documents I tried to convert. However, I noticed that hyperlinks and crossreference links (e. g. links to other pages in the same document) did not work after converting. It seems as if such links get deleted during the conversion.
The following test files exemplify this better. Original file: pdf2archive-conversion-test.pdf Converted file: pdf2archive-conversion-test-PDFA.pdf
Do you know the reason for this behavior and how easy it would be to resolve it?
Best, Marc