s-u / Cairo

R graphics device using cairo graphics library for creating high-quality output
12 stars 10 forks source link

pdf/a compatibility #38

Open iwelch opened 2 years ago

iwelch commented 2 years ago

[This is more a feature request than a bug report]

PDF/A is the archival version of PDF and therefore more suitable to long-term use of a pdf file. Many professional offset printers require PDF/A files. When a PDF/A file includes one graphic that is not PDF/A compliant, it is itself no longer compliant. Ergo, any document that includes a Cairo R pdf is no longer pdf/a compliable.

For example, take a simple file and run it through macos:

library(Cairo)
CairoFonts( regular="Charter:style=Regular" )
pdf(file="testcairo.pdf")
plot( 1:10, 1:10 )
dev.off()

run the output through the free VeraPDF compliance checker:

$ verapdf testcairo.pdf
<?xml version="1.0" encoding="utf-8"?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.20.2" buildDate="2022-05-19T08:23:00-07:00"></releaseDetails>
    <releaseDetails id="validation-model" version="1.20.2" buildDate="2022-05-19T08:27:00-07:00"></releaseDetails>
    <releaseDetails id="gui" version="1.20.3" buildDate="2022-05-19T09:10:00-07:00"></releaseDetails>
  </buildInformation>
  <jobs>
    <job>
      <item size="4982">
        <name>/Users/ivo/verapdf/testcairo.pdf</name>
      </item>
      <validationReport profileName="PDF/A-1B validation profile" statement="PDF file is not compliant with Validation Profile requirements." isCompliant="false">
        <details passedRules="96" failedRules="5" passedChecks="684" failedChecks="6">
          <rule specification="ISO 19005-1:2005" clause="6.3.4" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
            <description>The font programs for all fonts used within a conforming file shall be embedded within that file, as defined in PDF Reference 5.8, 
            except when the fonts are used exclusively with text rendering mode 3</description>
            <object>PDFont</object>
            <test>Subtype == "Type3" || Subtype == "Type0" || renderingMode == 3 || fontFile_size == 1</test>
            <check status="failed">
              <context>root/document[0]/pages[0](7 0 obj PDPage)/contentStream[0](8 0 obj PDContentStream)/operators[107]/font[0](Helvetica)</context>
              <errorMessage>The font program is not embedded</errorMessage>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.7.3" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
            <description>If a document information dictionary does appear at a document, then all of its entries that have analogous properties in predefined XMP schemas, shall also be embedded in the file in XMP form with equivalent values.</description>
            <object>CosDocument</object>
            <test>doesInfoMatchXMP</test>
            <check status="failed">
              <context>root</context>
              <errorMessage>Some of document information dictionary entries' that have analogous properties in predefined XMP schemas do not embedded or have not equivalent values in the file in XMP form.</errorMessage>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.1.7" testNumber="2" status="failed" passedChecks="0" failedChecks="2">
            <description>The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence
            or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker</description>
            <object>CosStream</object>
            <test>streamKeywordCRLFCompliant == true &amp;&amp; endstreamKeywordEOLCompliant == true</test>
            <check status="failed">
              <context>root/indirectObjects[2](8 0)/directObject[0]</context>
              <errorMessage>Spacings of keywords 'stream' and 'endstream' do not comply PDF/A specification</errorMessage>
            </check>
            <check status="failed">
              <context>root/indirectObjects[4](6 0)/directObject[0]</context>
              <errorMessage>Spacings of keywords 'stream' and 'endstream' do not comply PDF/A specification</errorMessage>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.7.2" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
            <description>The document catalog dictionary of a conforming file shall contain the Metadata key.</description>
            <object>PDDocument</object>
            <test>metadata_size == 1</test>
            <check status="failed">
              <context>root/document[0]</context>
              <errorMessage>The document catalog dictionary doesn't contain metadata key.</errorMessage>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.1.3" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
            <description>The file trailer dictionary shall contain the ID keyword. The file trailer referred to is either the last trailer dictionary in a PDF file,
            as described in PDF Reference 3.4.4 and 3.4.5, or the first page trailer in a linearized PDF file, as described in PDF Reference F.2</description>
            <object>CosDocument</object>
            <test>(isLinearized == true &amp;&amp; firstPageID != null) || ((isLinearized != true) &amp;&amp; lastID != null)</test>
            <check status="failed">
              <context>root</context>
              <errorMessage>Missing ID in the document trailer</errorMessage>
            </check>
          </rule>
        </details>
      </validationReport>
      <duration start="1662856868494" finish="1662856868688">00:00:00.194</duration>
    </job>
  </jobs>
  <batchSummary totalJobs="1" failedToParse="0" encrypted="0" outOfMemory="0" veraExceptions="0">
    <validationReports compliant="0" nonCompliant="1" failedJobs="0">1</validationReports>
    <featureReports failedJobs="0">0</featureReports>
    <repairReports failedJobs="0">0</repairReports>
    <duration start="1662856868473" finish="1662856868701">00:00:00.228</duration>
  </batchSummary>
</report>
s-u commented 2 years ago

Cairo is just a front-end to cairographics, so we really have no control over the PDF it creates. The only control I see is cairo_pdf_surface_restrict_to_version, but that only allows to choose from PDF 1.4 or 1.5. Searching for cairographics and PDF/A doesn't yield any hits. Is seems that some mention using post-processing tools to convert PDF files into PDF/A, so perhaps that's a way?

s-u commented 2 years ago

FWIW the closest discussion I found was https://github.com/Kozea/WeasyPrint/issues/630, but they solved the problem by ditching cairographics for PDF generation and replacing it with their own PDF generation tool, so I suspect there may not be as much hope as one would wish...

iwelch commented 2 years ago

thanks. would it be possible to file a suggestion with the upstream cairographics library? regards, /iaw

-- Ivo Welch @.***) http://www.ivo-welch.info/ J. Fred Weston Distinguished Professor, UCLA Anderson

On Thu, Sep 15, 2022 at 2:11 AM Simon Urbanek @.***> wrote:

FWIW the closest discussion I found was Kozea/WeasyPrint#630 https://github.com/Kozea/WeasyPrint/issues/630, but they solved the problem by ditching cairographics for PDF generation and replacing it with their own PDF generation tool, so I suspect there may not be as much hope as one would wish...

— Reply to this email directly, view it on GitHub https://github.com/s-u/Cairo/issues/38#issuecomment-1247811322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKCJU6A5V23QTIPUMJD2GDV6LRMBANCNFSM6AAAAAAQLTOO3A . You are receiving this because you authored the thread.Message ID: @.***>

iwelch commented 1 year ago

actually, the upstream thinks the main problem comes from R --- how its routines are called.

there is one aspect that would be good and maybe fixable --- having an option to open the CairoPDF device without transparency. transparency is a pdf/a killer and otherwise fixable only by rasterization, which obviously ends up looking terrible.

s-u commented 1 year ago

@iwelch thanks, do you have a link to the upstream discussion? If it is a matter of some settings then it should be doable, but it would be good to get some details.

iwelch commented 1 year ago

I think I overstated my case, reading back the conversation from way back when.

https://gitlab.freedesktop.org/cairo/cairo/-/issues/588#note_1549144

iwelch commented 1 year ago

hi S --- without knowing much more about Cairo, apparently if you open the device with

CAIRO_FORMAT_RGB24

instead of

CAIRO_FORMAT_ARGB32

then a big piece of the pdf/a problem would be fixed. it presumably would no longer want to write transparent images. without knowing the CairoPDF code, I am guessing this one would be easy (as an option on device CairoPDF(...,rgba=FALSE).

regards, /iaw

s-u commented 3 months ago

Those flags are only relevant to image surfaces, not PDF.

FWIW whether we use ARGB or RGB for image back-ends depends on a) the image type (e.g. jpeg is always RGB) and the background (for any solid background it is always RGB).

The only place where you can run into trouble is when using rasterImage in PDF which at this points always uses RGBA. We could modify it to not use RGBA if alpha for all pixels is 0xff, but this will have no effect on PDFs without rasterImage (so your example above doesn't apply).