Attachments in PDF failing to hyperlink

opoudjis commented 3 months ago

In https://github.com/metanorma/metanorma-standoc/issues/898 I have had to do some debugging of attachments, to make it possible to compile an Asciidoctor document with attachments outside of the working directory.

This has worked on HTML, with it finding the attachments now. But the PDF has stopped linking to attachments.

What is perplexing is

the difference between the commit where links worked for Ronald and links didn't is generating identical XML representation of the attachment
I am compiling from the same commit, and the PDF I generate does not hyperlink

Which makes me suspect this is not a matter of my code, but of processing constraints on the PDF.

I am sending the 200 MB Presentation XML on Skype for you to look at. @ronaldtse will be able to send you different iterations of the document in question.

Intelligent2013 commented 3 months ago

I've generated PDF and only one attachment presents in the PDF - READY-20230316-no-toc-iso-10303-49.pdf:

This attachment encoded in the Presentation XML as:

    <metanorma-extension>
...
        <attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBERi
...

    <p id="_bed0f9b3-394f-9910-dab9-8f46f0cb958b">Trial PDF document: <link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">10303-49/READY-20230316-no-toc-iso-10303-49.pdf</link>
...
    <bibliography>
        <references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
            <title depth="1">Bibliography</title>
            <bibitem id="attachment-10303-49-trial" hidden="true">
                <formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
                <uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
                <uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
                <docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
            </bibitem>
        </references>
    </bibliography>

Also, there are link with links to the files which should be attached to the PDF also:

<p id="_3c1b569d-6058-5228-5c17-0c06c39a7da7">PDF document comparison report: <link target="10303-49-comparison-report.pdf"/>
...
<p id="_a9f03ffe-d062-d97b-a425-e9e45692f302">Annotated EXPRESS schema: <link target="10303-49/method_definition_schema/method_definition_schema.exp"/>

I need update XSLT for such case. To differentiate link to the external entity like <link target="https://github.com/metanorma/iso-10303-detached-docs/issues/187"/>, I'll add the case: if link/@target doesn't start with https, http, www or ftp, then @target points to the file that should be attached to the PDF.

Also, there are xref with attachment- prefix:

<p id="_be27e7cc-b2c2-f0d7-8ccb-e2d32357c97f">Trial PDF document: <xref target="attachment-10303-50-trial">[attachment-10303-50-trial]</xref>

@opoudjis how to process such xref? How can I determine that xref points to the file instead of internal id? @target starts with attachment-?

ronaldtse commented 3 months ago

It is correct to only have 1 attachment. I can provide another file for you that I have linked the attachments but they are not attached.

There are two types of links.

A link to an attachment. This is a link that will open an attachment in the PDF. In HTML, it will open an external file.
An external link to whatever file, could be PDF, HTML, or any other format. In PDF it is only a path that will open a file in the file system.

opoudjis commented 3 months ago

I think part of the problem is that not all the attachments that were supposed to be there were, so the links weren't properly generated. (That might even be the case in the large file I also sent.)

Since I am addressing both HTML and DOC, should link/target be the same as attachment/name, so that you know which attachment is which? Or is the current arrangement workable?

If you see an xref, it simply is not an attachment, because the attachment has not been loaded in: attachments are loaded in via the bibliography. If the attachment had been loaded in, it would be showing up as an eref => link. You can ignore xref as an error in the underlying markup.

Intelligent2013 commented 3 months ago

There are two types of links.

A link to an attachment. This is a link that will open an attachment in the PDF. In HTML, it will open an external file.

It's working in the PDF:

An external link to whatever file, could be PDF, HTML, or any other format. In PDF it is only a path that will open a file in the file system.

It's working in the PDF also:

I can provide another file for you that I have linked the attachments but they are not attached.

@ronaldtse yes, it would be helpful.

Intelligent2013 commented 3 months ago

from my PDF - the link points to the embedded object:

from PDF generated by @ronaldtse - the link points to the external file:

I'll investigate it.

Intelligent2013 commented 3 months ago

How currently the attachment mechanism is working in the XSLT.

The Presentation XML contains:

attachment with name READY-20230316-no-toc-iso-10303-49.pdf:

  <metanorma-extension>...
    <attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBER...

the link with reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf:

   <link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">

bibitem with uri[@type="attachment"] = _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf

...
<bibliography>
    <references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
        <title depth="1">Bibliography</title>
        <bibitem id="attachment-10303-49-trial" hidden="true">
            <formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
            <uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
            <uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
            <docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
        </bibitem>
    </references>
</bibliography>

I.e. there isn't explicit relationship between the attachment READY-20230316-no-toc-iso-10303-49.pdf and link reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf

THEREFORE, the XSLT executes such actions:

get the input XML name without presentation.xml or .xml suffix, for instance document
add _ at the start and add _attachments at the end: _document_attachments.
if link/@target starts with _document_attachments/, then gets the string after _document_attachments/, i.e. READY-20230316-no-toc-iso-10303-49.pdf.
add link to the PDF embedded file READY-20230316-no-toc-iso-10303-49.pdf

The code:

    <xsl:template match="*[local-name()='link']" name="link">
            ...
                <xsl:when test="contains(@target, concat('_', $inputxml_filename_prefix, '_attachments'))">
                    <!-- link to the PDF attachment -->
                    <xsl:variable name="target_" select="translate(@target, '\', '/')"/>
                    <xsl:variable name="target__" select="substring-after($target_, concat('_', $inputxml_filename_prefix, '_attachments', '/'))"/>
                    <xsl:value-of select="concat('url(embedded-file:', $target__, ')')"/>
                </xsl:when>

BUT if input XML filename isn't document.presentation.xml or document.xml, then such mechanism isn't working. And link/@target will be point to the external file. So, looks like the input XML isn't document.presentation.xml. I have to change the XSLT, but currently, don't understand clearly how.

@opoudjis the question - _document_attachments/ is the fixed prefix for attached file or depends on the input adoc. I.e. for test.adoc the prefix in the Presentation XML in link/@target will be _test_attachments or document_attachments/?

I've found second issue with links. If there is a comment note on the page, then all references are not working, i.e, they are showing as blue text without links (the mouse pointer isn't changes on mouse over): This issue doesn't relate to the XSLT. Something wrong in the PDFBox post-processing for notes.

opoudjis commented 3 months ago

Can I get back to this query on Monday? I'm going out of town for the weekend. The prefix is indeed _{document-name}_attachments/{attachment-name}, which is why I suggested above that I make the name attribute in the attachment the same as the target attribute in the link, so that you do know they are the same. Looks like that is the right thing to do.

Intelligent2013 commented 3 months ago

@opoudjis ok.

Intelligent2013 commented 3 months ago

I've found second issue with links. If there is a comment note on the page, then all references are not working,

Fixed in mn2pdf (https://github.com/metanorma/mn2pdf/releases/tag/v1.96.)

Intelligent2013 commented 3 months ago

I've update common.xsl to process PDF attachments correctly if attachment/@name and link/@target doesn't equal. @opoudjis so no need to fix it urgently.

I've found another bug. The attachments:

READY-20230316-no-toc-iso-10303-50.pdf
READY-20230316-no-toc-iso-10303-104.pdf are broken. The Adobe Acrobat shows the error when attempt to open them:

The content of both PDF is truncated (doesn't end with %%EOF♪).

The reason - the text content of the element <attachment name="READY-20230316-no-toc-iso-10303-50.pdf">data:application/pdf;base64,... is 10000000 bytes exactly. Looks like there is the 10Mb limit somewhere in the XML api. Ping @opoudjis.

opoudjis commented 3 months ago

Hm.

I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.

The MB limit is a surprise to me, and I don't think it's my doing. I have recently imposed a 10 MB limit on images, but that should be resulting in crashes, and it should not be truncating. Will investigate.

opoudjis commented 3 months ago

The MB limit is indeed Nokogiri, even when I changed the code to append the string as a child. I am going to have to introduce linebreaks.

Odd that Nokogiri does not have this issue with XML attributes...

opoudjis commented 3 months ago

Nokogiri::XML(file, &:huge) might take care of it; I don't use it in standoc (to my surprise), though I do in metanorma collections. But having a 10 MB long line is asking for trouble anyway, so I will break it up into lins of 60 characters, per the older Base64 spec.

opoudjis commented 3 months ago

... Still didn't work... Having to add it one line at a time in Nokogiri.

opoudjis commented 3 months ago

Works. Will generate entire document and pass it to you.

Intelligent2013 commented 3 months ago

Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.

Intelligent2013 commented 3 months ago

mn2pdfends with the error on my machine:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

or

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The presentation XML size is 141Mb. I'll try to increase the max memory just for PDF generation.

Intelligent2013 commented 3 months ago

I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.

common.xsl updated for the processing explicit link from xref/@target to attachment/@name.

Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.

I don't understand why the PDF generated by @opoudjis contains only 1 page:

Works. Will generate entire document and pass it to you.

I've generated the PDF with increased Java heap space up to 5Gb, and can confirm that PDF contains correct all PDF attachments.

mn2pdfends with the error on my machine:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

or

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The error occurs on the Presentation XML size 141Mb, but process correctly old Presentation XML size 193Mb.

So, currently there is only one issue with Java heap space.

Intelligent2013 commented 3 months ago

I don't understand why the PDF generated by @opoudjis contains only 1 page:

After a few attempts I've generated PDF (86Mb) with 1 page. The log contains java.lang.OutOfMemoryError: Java heap space:

file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.OutOfMemoryError: Java heap space
file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.NullPointerException
...
Rendered page #1.
Bookmarks: Unresolved ID reference "_conclusion_3" found.
Bookmarks: Unresolved ID reference "_conclusion_4" found.
Bookmarks: Unresolved ID reference "_assembly_constraint_schema_schema" found.
...
Can't highlight the text ''.
Can't highlight the text ''.
Can't highlight the text ''.
...
Error parsing annotation information [null]. Annotation ignored
java.io.IOException: Error: wrong amount of numbers in attribute 'rect'
        at org.apache.pdfbox.pdmodel.fdf.FDFAnnotation.<init>(FDFAnnotation.java:205)
        at org.apache.pdfbox.pdmodel.fdf.FDFAnnotationText.<init>(FDFAnnotationText.java:67)
        at org.apache.pdfbox.pdmodel.fdf.FDFDictionary.<init>(FDFDictionary.java:155)
        at org.apache.pdfbox.pdmodel.fdf.FDFCatalog.<init>(FDFCatalog.java:63)
        at org.apache.pdfbox.pdmodel.fdf.FDFDocument.<init>(FDFDocument.java:90)
        at org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:241)
        at org.metanorma.fop.annotations.Annotation.process(Annotation.java:260)
        at org.metanorma.fop.PDFGenerator.runFOP(PDFGenerator.java:700)
        at org.metanorma.fop.PDFGenerator.convertmn2pdf(PDFGenerator.java:493)
        at org.metanorma.fop.PDFGenerator.process(PDFGenerator.java:311)
        at org.metanorma.fop.mn2pdf.main(mn2pdf.java:350)
...

but the process didn't end abnormally and PDF generated with 1 page. So this is exactly the error with Java heap space.

So, currently there is only one issue with Java heap space.

common.xsl optimized and now PDF generated successfully.

opoudjis commented 3 months ago

FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...

Intelligent2013 commented 3 months ago

FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...

@opoudjis could you share the Presentation XML to dropbox or similar? Thanks!

Intelligent2013 commented 3 months ago

@opoudjis thank you! I have Exception in thread "main" java.lang.OutOfMemoryError: Java heap space also with 144Mb Presentation XML. But the PDF for previous version (141Mb) generates ok. I'll investigate it.

Intelligent2013 commented 3 months ago

@opoudjis issue Exception in thread "main" java.lang.OutOfMemoryError: Java heap space fixed in the XSLT.

metanorma / metanorma-standoc

Attachments in PDF failing to hyperlink #900

How currently the attachment mechanism is working in the XSLT.