latex3 / babel

The babel system for LaTeX, LuaLaTeX and XeLaTeX
LaTeX Project Public License v1.3c
124 stars 34 forks source link

wish: set Language metadata-field in the PDF automatically if the main language is explicitly set in the LaTeX document and pdflang is not supplied #244

Closed ghost closed 1 year ago

ghost commented 1 year ago

After about 20 years of semi-ignorant LaTeXing, I discovered the document-language meta-field in the PDF Catalog and the option pdflang of hyperref, which leads me to the following wish.

Some LaTeX documents set the main language explicitly. Examples:

1.

\documentclass[ngerman]{article}
\usepackage{babel}
\usepackage{hyperref}
…

2.

\documentclass{article}
\usepackage[ngerman]{babel}
\usepackage{hyperref}
…

3.

\documentclass[french,USenglish,ngerman,deutsch]{svmono}
\usepackage[french,USenglish,main=ngerman]{babel}
\usepackage{hyperref}
…

We ask that in such clear cases (namely, if the main language of the LaTeX document is explicitly set AND the pdflang=… option of hyperref is missing), the /Lang meta-data field of the PDF Catalog be automatically set according to the main language of the LaTeX document. In the examples above, the value of this field would be de (as far as I understand the PDF specification). As of now, the documentation of hyperref says that the default value is relax, which means that the /Lang field simply does not appear in the metadata. There's no need to keep it this way if the language is that clear.

The final word on whether the task of setting /Lang should lie with babel or hyperref (if we were to choose between these two options only; cf. https://github.com/latex3/babel/issues/244#issuecomment-1548717497) is, of course, with the developers. My take on this is that though technically the pdflang option belongs to hyperref at this moment, the document language field in the PDF Catalog has nothing to do with the hyper part of hypertext and thus probably better be logically done by (the localization and internationalization package) babel.

Crosspost: http://github.com/latex3/hyperref/issues/280 .

Gratefully,

AlMa

u-fischer commented 1 year ago

If the pdfmanagement is loaded (with \DocumentMetadata) /Lang is already set to english by default, and will shortly (probably at the next update of the pdfresource-testphase) make use of the new bcpdata interface of the language packages and set the value to the main language of the document.

Without the pdfmanagement it is not really possible to set /Lang automatically: if would clash with a manual setting in documents, you could end up with two /Lang entries in the catalog and that is invalid. In this case it is in the responsability of the author to add it.

ghost commented 1 year ago

@u-fischer Thx! Good news!

Without the pdfmanagement it is not really possible to set /Lang automatically: if would clash with a manual setting in documents, you could end up with two /Lang entries in the catalog and that is invalid. In this case it is in the responsability of the author to add it.

That's why above I wrote “if […] the pdflang=… option of hyperref is missing”. The absence of a package option as this one is probably testable. Of course, the author might hypothetically try to set /Lang manually in some other way (but such another way is not known to me at the moment).

jbezos commented 1 year ago

I was working on it, based on https://github.com/latex3/latex2e/pull/1036#issuecomment-1513506903 . However, I was wondering if there is a way to check if the language has been manually set, which should take precedence. Or maybe it should be set by pdfmanagement directly based on the \BCPdata? @u-fischer?

u-fischer commented 1 year ago

Of course, the author might hypothetically try to set /Lang manually in some other way (but such another way is not known to me at the moment).

@AlMa0r well hyperref doesn't add the entry with magic, it uses the relevant engine primitives. They are no secret and naturally authors and other packages can use them too. hyperref and babel can't catch that, as it could have happened before they are even loaded.

@jbezos

Well personally I think that the main language is such an important metadata, that it should be set by the author in \DocumentMetadata with the lang key, babel could then check if it matches the main language set with the babel options and the pdfmanagement could at the end of the document check if it matches \BCPdata{language.main} (or how ever this is called).

jbezos commented 1 year ago

But is there something like \pdfmanagement_if_set:xx{Catalog}{Lang}? I couldn't find anything in the manual, but maybe I haven’t searched well (I’ve found \pdfdict_if_exist, but seems to be low-level).

u-fischer commented 1 year ago

Lang is always set (by default to en-US), you can retrieve the current value with

\DocumentMetadata{lang=fr} 
\documentclass{article} 
\begin{document} 
\ShowDocumentProperties 
\GetDocumentProperties{document/lang} 
\end{document}
jbezos commented 1 year ago

👍 Settíng en-US explicitly in \DocumentMetadata and then loading \usepackage[british]{babel} seem pointless, so I presume we can live with it 🙂.

ghost commented 1 year ago

@u-fischer I understand. Concerning /Lang, is it really wise to always set a default value if there's no explicit one? In a lorem-ipsum kind of a document, this should be Latin or absent, and in a wordless tikz/pstricks drawing (e.g., http://tex.stackexchange.com/questions/685753 or http://tex.stackexchange.com/questions/685761), the language should probably be intentionally absent so as not to confuse, say, the search engines that search for contents written (not) in a particular language. Truly multilingual documents (say, dictionaries or translations with left column for language A and right column for language B) constitute another good example: an author might intentionally NOT favor one of the languages of his/her multilingual document over the other language(s). An extreme example are the works of bilingual writers, such as Nabokov. Having said this, I think now there might even be a need to differentiate between intentionally undefined and undefined so that LaTeX makes a clever guess on the conceptual level. In the first case, the packages including pdfmanagement should NOT attempt to guess the language, and in the second case, they might try to make a clever guess. However, all these thoughts concerns the ambiguous case. The wish in my original post concerns, on the contrary, the clear case, in which the author expresses a definite intention and sets exactly one main language in an unambiguous and clear way. To handle things one at a time and keep them simple, I suggest that this thread concentrates on the clear case.

@jbezos As for deciding whether (or how) the pdflang option has been set by hyperref, this seems easy, too:

\makeatletter
\AtBeginDocument{
\ifdefined\@pdflang
  \ifx\@pdflang\relax
    \typeout{pdflang is not set}
  \else
    \typeout{pdflang is set to \@pdflang}
  \fi
\else
  \typeout{pdflang is undefined}
\fi
}
\makeatother

I think that a conflict between en-US and british warrants a warning or an error.

jbezos commented 1 year ago

I think now there might even be a need to differentiate between intentionally undefined and undefined so that LaTeX does a clever guess on the conceptual level.

Undefined... or defined, which was my point. Anyway, imo leaving this field unfilled is not a good idea. The lacking of an explicit language might be intended, but might be just a mistake. There are tags for an undefined language (und, which is what babel sets for the nil language), non-linguistic content (zxx) and multiple languages (mul). Note, however, the IEFT prefers omitting the 'und' tag except if required.

u-fischer commented 1 year ago

Concerning /Lang, is it really wise to always set a default value if there's no explicit one?

We have decided that /Lang should be set by default. English has been chosen as default as a standard LaTeX documents uses english hypenation patterns and english words. If an author wants something else they will have to overwrite this (and the pdfmanagement also allows to remove it again).

Please not that the lang key in \DocumentMetadata is not only meant for the PDF catalog, it should also inform other packages about the main language. That's why it should be in sync with the language as set with babel or polyglossia.

ghost commented 1 year ago

I think now there might even be a need to differentiate between intentionally undefined and undefined so that LaTeX does a clever guess on the conceptual level.

Undefined... or defined, which was my point. Anyway, imo leaving this field unfilled is not a good idea. The lacking of an explicit language might be intended, but might be just a mistake. There are tags for an undefined language (und, which is what babel sets for the nil language), non-linguistic content (zxx) and multiple languages (mul). Note, however, the IEFT prefers omitting the 'und' tag except if required.

I didn't know this earlier. Then, in the wordless example I mentioned, tag zxx could be the best default. For dictionaries and translations and multi-language novels, consider mul. English was probably justified only historically as the default for LaTeX, and the time might be ripe for a change. However, again, let's better talk about the value of /Lang in the clear cases here in this thread.

jbezos commented 1 year ago

@u-fischer In my tests \GetDocumentProperties{document/lang} returns the value set in \DocumentMetadata, not the actual language to be declared in the /Lang field:

\DocumentMetadata{lang=de}
\documentclass{article}
\pdfcompresslevel=0
\ExplSyntaxOn
\pdfmanagement_add:xxx{Catalog}{Lang}{en}
\ExplSyntaxOff
\begin{document}
\GetDocumentProperties{document/lang}
\end{document}

prints ‘de’, but the pdf file rightly says:

/Type /Catalog
/Pages 11 0 R
/Lang en/Metadata 5 0 R

I find this mismatch somewhat counterintuitive.

u-fischer commented 1 year ago

Yes you would have to update the document properties if you set manually, (and you need parentheses around the value). But as I said: imho the DocumentMetadata value should take precedence, you shouldn't change it but issue a warning that the languages don't match.

jbezos commented 1 year ago

Since I still don't have a clear idea of what to do, I'm going to give up for now. I'll close the issue and move it to the Enhancement Requests, to resume it in the future.

ghost commented 1 year ago

@jbezos At the risk of splitting hairs, you probably want to say that you don't know what to do exactly. In case you do know this, you might wish to say that you don't know how to do it. (To be fair, I also do not know what to do exactly or how to do it.)

As of what to do in abstract, high-level terms, I'll re-try to clarify this: if the author sets exactly one main language in the input document in exactly one fashion or in compatible fashions (e.g., as class or babel-package options and does not invoke pdflang=…), we wish to set /Lang in the PDF Catalog (e.g., via pdfmanagement).

jbezos commented 1 year ago

@AlMa0r I usually reason with examples like this: “A French university has published a class, which sets French as the main language. However, it can be used with other languages”, and then I analyze several options wrt babel, document metadata, and the like. In other words, it’s a mixture of what, how, where and, above all, why. I have to re-read the manuals and the code, consider other possible cases, etc. I though it was straightforward (I’m optimistic by nature), but I’ve realized it’s not. I also have to think about what @u-fischer has said about “imho the DocumentMetadata value should take precedence” (which usually seems sensible, but always?).

ghost commented 1 year ago

@jbezos I see. By the way, pdfmanagement and hyperref together seem to have unstable semantics so far (e.g., the documentation of the version 0.95t said that pdflang is deprecated, and the documentation of the version 0.95x no longer says this). So you might have some influence on the specification of the PDF management code, i.e., on what this code should do. On the issue of the default language value, cf. http://github.com/latex3/pdfresources/issues/51 .