Closed opoudjis closed 3 months ago
I've generated PDF and only one attachment presents in the PDF - READY-20230316-no-toc-iso-10303-49.pdf
:
This attachment encoded in the Presentation XML as:
<metanorma-extension>
...
<attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBERi
...
<p id="_bed0f9b3-394f-9910-dab9-8f46f0cb958b">Trial PDF document: <link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">10303-49/READY-20230316-no-toc-iso-10303-49.pdf</link>
...
<bibliography>
<references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
<title depth="1">Bibliography</title>
<bibitem id="attachment-10303-49-trial" hidden="true">
<formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
<uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
<uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
<docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
</bibitem>
</references>
</bibliography>
Also, there are link
with links to the files which should be attached to the PDF also:
<p id="_3c1b569d-6058-5228-5c17-0c06c39a7da7">PDF document comparison report: <link target="10303-49-comparison-report.pdf"/>
...
<p id="_a9f03ffe-d062-d97b-a425-e9e45692f302">Annotated EXPRESS schema: <link target="10303-49/method_definition_schema/method_definition_schema.exp"/>
I need update XSLT for such case. To differentiate link to the external entity like <link target="https://github.com/metanorma/iso-10303-detached-docs/issues/187"/>
, I'll add the case: if link/@target
doesn't start with https
, http
, www
or ftp
, then @target
points to the file that should be attached to the PDF.
Also, there are xref
with attachment-
prefix:
<p id="_be27e7cc-b2c2-f0d7-8ccb-e2d32357c97f">Trial PDF document: <xref target="attachment-10303-50-trial">[attachment-10303-50-trial]</xref>
@opoudjis how to process such xref
? How can I determine that xref
points to the file instead of internal id
? @target
starts with attachment-
?
It is correct to only have 1 attachment. I can provide another file for you that I have linked the attachments but they are not attached.
There are two types of links.
I think part of the problem is that not all the attachments that were supposed to be there were, so the links weren't properly generated. (That might even be the case in the large file I also sent.)
Since I am addressing both HTML and DOC, should link/target be the same as attachment/name, so that you know which attachment is which? Or is the current arrangement workable?
If you see an xref, it simply is not an attachment, because the attachment has not been loaded in: attachments are loaded in via the bibliography. If the attachment had been loaded in, it would be showing up as an eref => link. You can ignore xref as an error in the underlying markup.
There are two types of links.
- A link to an attachment. This is a link that will open an attachment in the PDF. In HTML, it will open an external file.
It's working in the PDF:
- An external link to whatever file, could be PDF, HTML, or any other format. In PDF it is only a path that will open a file in the file system.
It's working in the PDF also:
I can provide another file for you that I have linked the attachments but they are not attached.
@ronaldtse yes, it would be helpful.
I'll investigate it.
The Presentation XML contains:
READY-20230316-no-toc-iso-10303-49.pdf
:
<metanorma-extension>...
<attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBER...
link
with reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf
:
<link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">
bibitem
with uri[@type="attachment"]
= _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf
...
<bibliography>
<references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
<title depth="1">Bibliography</title>
<bibitem id="attachment-10303-49-trial" hidden="true">
<formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
<uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
<uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
<docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
</bibitem>
</references>
</bibliography>
I.e. there isn't explicit relationship between the attachment READY-20230316-no-toc-iso-10303-49.pdf
and link
reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf
THEREFORE, the XSLT executes such actions:
presentation.xml
or .xml
suffix, for instance document
_
at the start and add _attachments
at the end: _document_attachments
.link/@target
starts with _document_attachments/
, then gets the string after _document_attachments/
, i.e. READY-20230316-no-toc-iso-10303-49.pdf
.READY-20230316-no-toc-iso-10303-49.pdf
The code:
<xsl:template match="*[local-name()='link']" name="link">
...
<xsl:when test="contains(@target, concat('_', $inputxml_filename_prefix, '_attachments'))">
<!-- link to the PDF attachment -->
<xsl:variable name="target_" select="translate(@target, '\', '/')"/>
<xsl:variable name="target__" select="substring-after($target_, concat('_', $inputxml_filename_prefix, '_attachments', '/'))"/>
<xsl:value-of select="concat('url(embedded-file:', $target__, ')')"/>
</xsl:when>
BUT if input XML filename isn't document.presentation.xml
or document.xml
, then such mechanism isn't working. And link/@target
will be point to the external file.
So, looks like the input XML isn't document.presentation.xml
.
I have to change the XSLT, but currently, don't understand clearly how.
@opoudjis the question - _document_attachments/
is the fixed prefix for attached file or depends on the input adoc. I.e. for test.adoc
the prefix in the Presentation XML in link/@target
will be _test_attachments
or document_attachments/
?
I've found second issue with links. If there is a comment note on the page, then all references are not working, i.e, they are showing as blue text without links (the mouse pointer isn't changes on mouse over): This issue doesn't relate to the XSLT. Something wrong in the PDFBox post-processing for notes.
Can I get back to this query on Monday? I'm going out of town for the weekend. The prefix is indeed _{document-name}_attachments/{attachment-name}, which is why I suggested above that I make the name attribute in the attachment the same as the target attribute in the link, so that you do know they are the same. Looks like that is the right thing to do.
@opoudjis ok.
I've found second issue with links. If there is a comment note on the page, then all references are not working,
Fixed in mn2pdf
(https://github.com/metanorma/mn2pdf/releases/tag/v1.96.)
I've update common.xsl
to process PDF attachments correctly if attachment/@name
and link/@target
doesn't equal. @opoudjis so no need to fix it urgently.
I've found another bug. The attachments:
READY-20230316-no-toc-iso-10303-50.pdf
READY-20230316-no-toc-iso-10303-104.pdf
are broken. The Adobe Acrobat shows the error when attempt to open them:
The content of both PDF is truncated (doesn't end with %%EOF♪
).
The reason - the text content of the element
<attachment name="READY-20230316-no-toc-iso-10303-50.pdf">data:application/pdf;base64,...
is 10000000 bytes exactly. Looks like there is the 10Mb limit somewhere in the XML api. Ping @opoudjis.
Hm.
I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.
The MB limit is a surprise to me, and I don't think it's my doing. I have recently imposed a 10 MB limit on images, but that should be resulting in crashes, and it should not be truncating. Will investigate.
The MB limit is indeed Nokogiri, even when I changed the code to append the string as a child. I am going to have to introduce linebreaks.
Odd that Nokogiri does not have this issue with XML attributes...
Nokogiri::XML(file, &:huge)
might take care of it; I don't use it in standoc (to my surprise), though I do in metanorma collections. But having a 10 MB long line is asking for trouble anyway, so I will break it up into lins of 60 characters, per the older Base64 spec.
... Still didn't work... Having to add it one line at a time in Nokogiri.
Works. Will generate entire document and pass it to you.
Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.
mn2pdf
ends with the error on my machine:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
or
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
The presentation XML size is 141Mb. I'll try to increase the max memory just for PDF generation.
I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.
common.xsl
updated for the processing explicit link from xref/@target
to attachment/@name
.
Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.
I don't understand why the PDF generated by @opoudjis contains only 1 page:
Works. Will generate entire document and pass it to you.
I've generated the PDF with increased Java heap space up to 5Gb, and can confirm that PDF contains correct all PDF attachments.
mn2pdf
ends with the error on my machine:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
or
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
The error occurs on the Presentation XML size 141Mb, but process correctly old Presentation XML size 193Mb.
So, currently there is only one issue with Java heap space.
I don't understand why the PDF generated by @opoudjis contains only 1 page:
After a few attempts I've generated PDF (86Mb) with 1 page. The log contains java.lang.OutOfMemoryError: Java heap space
:
file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.OutOfMemoryError: Java heap space
file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.NullPointerException
...
Rendered page #1.
Bookmarks: Unresolved ID reference "_conclusion_3" found.
Bookmarks: Unresolved ID reference "_conclusion_4" found.
Bookmarks: Unresolved ID reference "_assembly_constraint_schema_schema" found.
...
Can't highlight the text ''.
Can't highlight the text ''.
Can't highlight the text ''.
...
Error parsing annotation information [null]. Annotation ignored
java.io.IOException: Error: wrong amount of numbers in attribute 'rect'
at org.apache.pdfbox.pdmodel.fdf.FDFAnnotation.<init>(FDFAnnotation.java:205)
at org.apache.pdfbox.pdmodel.fdf.FDFAnnotationText.<init>(FDFAnnotationText.java:67)
at org.apache.pdfbox.pdmodel.fdf.FDFDictionary.<init>(FDFDictionary.java:155)
at org.apache.pdfbox.pdmodel.fdf.FDFCatalog.<init>(FDFCatalog.java:63)
at org.apache.pdfbox.pdmodel.fdf.FDFDocument.<init>(FDFDocument.java:90)
at org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:241)
at org.metanorma.fop.annotations.Annotation.process(Annotation.java:260)
at org.metanorma.fop.PDFGenerator.runFOP(PDFGenerator.java:700)
at org.metanorma.fop.PDFGenerator.convertmn2pdf(PDFGenerator.java:493)
at org.metanorma.fop.PDFGenerator.process(PDFGenerator.java:311)
at org.metanorma.fop.mn2pdf.main(mn2pdf.java:350)
...
but the process didn't end abnormally and PDF generated with 1 page. So this is exactly the error with Java heap space.
So, currently there is only one issue with Java heap space.
common.xsl
optimized and now PDF generated successfully.
FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...
FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...
@opoudjis could you share the Presentation XML to dropbox or similar? Thanks!
@opoudjis thank you! I have Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
also with 144Mb Presentation XML. But the PDF for previous version (141Mb) generates ok. I'll investigate it.
@opoudjis issue Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
fixed in the XSLT.
In https://github.com/metanorma/metanorma-standoc/issues/898 I have had to do some debugging of attachments, to make it possible to compile an Asciidoctor document with attachments outside of the working directory.
This has worked on HTML, with it finding the attachments now. But the PDF has stopped linking to attachments.
What is perplexing is
Which makes me suspect this is not a matter of my code, but of processing constraints on the PDF.
I am sending the 200 MB Presentation XML on Skype for you to look at. @ronaldtse will be able to send you different iterations of the document in question.