speedata / publisher

speedata Publisher - a professional database Publishing system
https://www.speedata.de/
GNU Affero General Public License v3.0
292 stars 36 forks source link

handle `/Lang` more gracefully #434

Closed pr-apes closed 1 year ago

pr-apes commented 1 year ago

@pgundlach,

432 and #433 (sorry, but I don't know how to submit a single pull request) handle /Lang more gracefully:

  1. German language isn't harcoded.
  2. It is independent from <PDFOptions format="PDF/UA"/>.
  3. It depends from Options mainlanguage=""/>.
  4. I don't know how to recall values from the --mainlanguage argument in the command line invocation.

I hope it helps.

pr-apes commented 1 year ago

@pgundlach.

I'm going to submit another pull request, that supersedes both #432 and #433.

--mainlanguage value is also processed in the new pull request.

I hope it helps.

pr-apes commented 1 year ago

435 is the new pull request.

pr-apes commented 1 year ago

There is an issue with language identifiers: they contain hyphens and not underscores.

According to https://www.rfc-editor.org/rfc/rfc3066#section-2.1:

The syntax of this tag in ABNF [RFC 2234] is:

Language-Tag = Primary-subtag *( "-" Subtag )

Primary-subtag = 1*8ALPHA

Subtag = 1*8(ALPHA / DIGIT)

The productions ALPHA and DIGIT are imported from RFC 2234; they denote respectively the characters A to Z in upper or lower case and the digits from 0 to 9. The character "-" is HYPHEN-MINUS (ABNF: %x2D).

Although the PDF puts en-GB and es-MX as examples for locales, they don't seem to be recognized as main languages for the document. es-ES, en-US (or de-DE [but not de-AT or de-CH]) are valid values.

That being said, the default language is en_GB.

pgundlach commented 1 year ago

Although the PDF puts en-GB and es-MX as examples for locales, they don't seem to be recognized as main languages for the document. es-ES, en-US (or de-DE [but not de-AT or de-CH]) are valid values.

this sounds as it might be better to just use the first part such as es or en?

pgundlach commented 1 year ago

It could be discussed if the /Lang attribute in the PDF catalog should be automatically set by the default language of the document.

This could be a solution:

  1. if the user does not set anything, put /Lang (en) in the catalog
  2. if the user sets a main language such as "German" or "de_DE", put /Lang (de) in the catalog.
  3. The user could override a language by setting PDFOptions lang="..."

Does this sound ok?

pr-apes commented 1 year ago

I think there are different questions here:

  1. Acrobat doesn't recognize some languages (and maybe we can wait until Adobe fixes this [after all, it is their program not following the spec).
  2. The language identifiers contain hyphens instead of underscores (so, es_ES or en_UK are not valid values for languages [as /Lang requires them]).
  3. It might not be a good idea to add a <PDFOptions lang="…" /> when <Options mainlanguage="…"> and--mainlanguage` are already available.

In my opinion, the easiest way to avoid both issues (Acrobat not recognizing values with hyphens and invalid values with underscores) would be to read just before the hyphen or underscore (as you propose in your first two items [if I'm not missing your point]).

BTW, I don't know whether your reply here was written after https://github.com/speedata/publisher/pull/435#issuecomment-1253359473.

After writing this reply, I'm going to reply #435.

pgundlach commented 1 year ago

I think this should be enough, unless I've missed something:

diff --git a/src/lua/publisher.lua b/src/lua/publisher.lua
index 76dd131a..114ce00c 100644
--- a/src/lua/publisher.lua
+++ b/src/lua/publisher.lua
@@ -1388,6 +1388,11 @@ function initialize_luatex_and_generate_pdf()
     if str then
         pdfcatalog[#pdfcatalog + 1] = str
     end
+    local langtbl = get_language(defaultlanguage)
+
+    if langtbl and langtbl.locale then
+        pdfcatalog[#pdfcatalog+1] = string.format(" /Lang (%s)",string.gsub(langtbl.locale,"^(%a+).*","%1"))
+    end

     local vp = {}
     if viewerpreferences.numcopies and viewerpreferences.numcopies > 1 and viewerpreferences.numcopies <= 5 then
@@ -1458,7 +1463,7 @@ function initialize_luatex_and_generate_pdf()
             pdfcatalog[#pdfcatalog + 1] = string.format("/OutputIntents %d 0 R",outputintentsarrayobjnum )
         end
         if options.format == "PDF/UA" then
-            pdfcatalog[#pdfcatalog + 1] = string.format("/Lang (de)  /MarkInfo <<  /Marked true >> ")
+            pdfcatalog[#pdfcatalog + 1] = string.format(" /MarkInfo <<  /Marked true >> ")
             metadataobjnum = pdf.obj({ type="stream", string = getuametadata(), immediate = true, attr = [[  /Subtype /XML /Type /Metadata ]],compresslevel = 0,})
             vp[#vp + 1] = "/DisplayDocTitle true"
pr-apes commented 1 year ago

This is the way to go.

Many thanks for the implementation.