Upcasing in Greek no longer drops diacritics

gmilde commented 1 year ago

Greek letters drop diacritics (eccept dialytika and sub-iota) in UPPERCASE. This is not cared for by the Unicode standard (except with a reference implementation for a Greek-specific upcasing algorithm https://icu.unicode.org/design/case/greek-upper). Polyglossia uses the \uccode re-definitions from "xgreek.sty" to achieve this effect. However, since the new implementation in the 2022/06 LaTeX release, these re-definitions are ignored by \MakeUppercase.

\documentclass{article}
\usepackage{fontspec}
\setmainfont{DejaVu Serif}

\usepackage{polyglossia}
\setmainlanguage[variant=poly]{greek}
% if works still with Babel
% \usepackage[polutoniko.greek]{babel}

\begin{document}
 άδικος, κείμενο, ίριδα → \MakeUppercase{άδικος, κείμενο, ίριδα}
\end{document}

Somehow a "locale setting" by the Babel core activates Greek upcasing rules while by default the common upcasing rules from Unicode are used for Greek literals. Unfortunately, there is no documented API that I know of.

jspitz commented 1 year ago

Note that \MakeUppercase[locale=el]{άδικος, κείμενο, ίριδα} works. The reason why babel works, AFAIU, is not due to a change in babel, but due to an incomplete kernel change which is meanwhile using expl3's \text_uppercase:nn and friends in the legacy commands and currently only checks for babel, not (yet) for polyglossia.

See ltnews36 which states how:

The commands \MakeUppercase, \MakeLowercase
and \MakeTitlecase now automatically detect the
locale currently in use when babel is loaded. This
allows automatic adjustment of letter mappings where
appropriate. They also accept a leading optional
argument. This accepts a key–value list of control
settings. At present, there is one key available: locale,
which can also be accessed via the alias lang. This is
intended to allow local setting of the language, which
can be done using a BCP-47 descriptor. For example,
this could be used to force Turkish case changing in
otherwise English input
\MakeUppercase[lang = tr]{Ragıp Hulûsi Özdem}
yields RAGIP HULÛSİ ÖZDEM.

Now the question is whether it is planned to also detect the locale when polyglossia is loaded. If not, we need to roll our own definitions à la:

% Roll a locale-aware \MakeUppercase version that handles diacritics correctly (#581)
% This employs the expl3 case changer
\ExplSyntaxOn
   \cs_gset:Npn \MakeXPGGreekUppercase #1 { \text_uppercase:nn { el } {#1} }
\ExplSyntaxOff

% Save original \MakeUppercase definition
\let\xpg@save@MakeUppercase\MakeUppercase

\def\blockextras@greek{%
  \let\MakeUppercase\MakeXPGGreekUppercase%
}

\def\noextras@greek{%
   % restore original \MakeUppercase definition
   \let\MakeUppercase\xpg@save@MakeUppercase%
}

Of course I would very much prefer if we could avoid all the necessary ad hoc definitions in our gloss files. @josephwright are there serious reasons for the neglection of polyglossia or is this planned for later anyway?

jspitz commented 1 year ago

BTW @josephwright it looks like the following addition (ensuing the respective babel test in the kernel) will do:

    \@ifpackageloaded { polyglossia }
       {
    \@ifpackagelater { polyglossia } { 2020-01-29 }
    {
        \cs_gset_protected:Npn \@@text@case@aux@
        {
            \str_set:Nx \reserved@a
            { \csuse{bcp47id} }
        }
    }
    { }
    }
    { }

jspitz commented 1 year ago

(I'm using \csuse{bcp47id} here since \languageid is not expandable. We can also provide an expandable version, but it does not strike me necessary here).

u-fischer commented 1 year ago

@jspitz

It would be better if polyglossia would provide an expandable \localeinfo* { tag.bcp47 } too, so that the kernel doesn't have to test for external packages. Having a common interface shared by both packages would also allow to remove the test for babel and define a dummy in the kernel.

jspitz commented 1 year ago

I don't understand why we would need to emulate the babel syntax (which is newer than polyglossia's)

u-fischer commented 1 year ago

I don't understand why we would need to emulate the babel syntax (which is newer than polyglossia's)

well Joseph asked last year for a common interface for the BCP-data.

But Javier just reminded me that after this discussion babel added another interface. If polyglossia would implement that too, the kernel could use it in the next release.

\documentclass{article}

\usepackage[austrian]{babel}

\begin{document}

(\BCPdata{language}) (\BCPdata{script}) (\BCPdata{region})

\end{document}

jspitz commented 1 year ago

Forget my request. I'll fix this in polyglossia.

u-fischer commented 1 year ago

if you mean that you want to redefine \MakeUppercase: Apart from the fact that I think that patching kernel commands like this isn't a good idea, be aware that hyperref relies for the bookmarks on an expandable interface to the bcp tag too (currently it still uses \localeinfo, but I will switch soon to \BCPtag too).

jspitz commented 1 year ago

Fixed downstream.

jspitz commented 1 year ago

if you mean that you want to redefine \MakeUppercase: Apart from the fact that I think that patching kernel commands like this isn't a good idea, be aware that hyperref relies for the bookmarks on an expandable interface to the bcp tag too (currently it still uses \localeinfo, but I will switch soon to \BCPtag too).

\csuse{bcp47id} is expandable and works well for the hyperref case. But since you seem to be unwilling to support everything else than babel syntax, I am going to patch into \MakeUppercaseUnsupportedInPdfStrings and \MakeLowercaseUnsupportedInPdfStrings as well.

josephwright commented 1 year ago

Of course I would very much prefer if we could avoid all the necessary ad hoc definitions in our gloss files. @josephwright are there serious reasons for the neglection of polyglossia or is this planned for later anyway?

There's no intention to neglect polyglossia, but to explain the current situation, the background as I see it.

The change of approach to case changing brought in the idea of locale to a kernel function for the first time. This raised the opportunity to have a fixed kernel definition and to then pick up the locale set by packages. However, as there is currently no generally-agreed kernel interface for the the locale data, we couldn't just 'make a change'.

I therefore looked at what babel and polyglossia offer in terms of interface for BCP-47 info: I was hoping for a common one that we could use in the kernel. Again, there isn't a current agreed common interface so I looked at what there is in both packages. As has been mentioned already, we ideally need something expandable, or at least returning a 'string' containing the BCP-47 info (or a proxy to it). In babel, there is a documented interface that does that, but for polyglossia there is not currently: I know \bcp47id is available but that's an internal data store as far as I can make out.

Ideally, the kernel shouldn't be checking for any packages at all, so we did have quite a lively discussion about whether adding code that is dependent on babel is 'acceptable'. We've gone with the current approach as a temporary measure as babel has a documented interface for recovering BCP-47 data (including decomposition) (and so if 'temporary' sticks we are using something documented).

However, the aim remains to find a way toward an agreed single interface that both babel and polyglossia can use to make the information available. From my point of view, it would be ideal if the two could agree on a common data structure too, such that loading both would mean that setting the locale from either would apply to both, but I suspect at present that is too much of an ask.

With a common interface, the kernel can then avoid having any package dependence at all: we set up the interface in the kernel, babel and polyglossia can then re-define it to provide the data in an appropriate way (or if we can agree a common data store, they can populate the data store and the kernel interface could be left alone).

In terms of a common interface, I think many things are possible. One could for example imagine documenting \bcp47id and arranging for babel to populate it. Or one could take babel's \BCPdata{<aspect>} approach, or ... I guess I would weakly favour the babel method as it offers decomposition of the data into different aspects, which I suspect will become more useful over time.

What we really wanted to avoid is a set of 'if package loaded' tests in the kernel. The aim of the current temporary situation was to move forward somewhat whilst not simply putting in a set of patches that would be forgotten and bite all of us later. The current thread is in that sense what I was anticipating: coming back to the interface questions and trying to get to a position everyone can work with.

u-fischer commented 1 year ago

\csuse{bcp47id} is expandable and works well for the hyperref case. But since you seem to be unwilling to support everything else than babel syntax,

I'm willing to support whatever syntax, but as Joseph wrote: we need one syntax that in the long term is not dependant on package tests. The need to query the locale will appear also in other places, e.g. in tagging structures where one sometimes have to set the language too, and it won't do if the kernel or hyperref has to plant everywhere tests about special package code.

I am going to patch into \MakeUppercaseUnsupportedInPdfStrings and \MakeLowercaseUnsupportedInPdfStrings as well.

Well I can't prevent you to go this way, but I doubt that these commands will stay forever in hyperref.

@josephwright

I guess I would weakly favour the babel method as it offers decomposition of the data into different aspects, which I suspect will become more useful over time.

Yes, I agree with this.

jspitz commented 1 year ago

I completely agree with "a common interface would be ideal". However, devising such a common interface requires, in my opinion, that all involved parties sit down and talk about how this should look like. At least in the cases where I have been involved, this has not been the case. The babel maintainer did not show any interest in corporation but simply enforced his own approach (without even considering talking back with us). This is not how collaboration works, and my motivation to embark on this ship is consequently low.

u-fischer commented 1 year ago

The babel maintainer did not show any interest in corporation but simply enforced his own approach (without even considering talking back with us)

\BCPdata was a suggestion from Joseph, and is not a "babel approach". babel only implemented it after the discussion by mail in may in which you participated too. If you don't like the idea, please suggest something else for the collaboration with the kernel and packages like hyperref.

jspitz commented 1 year ago

I have added a very basic (expandable) \BCPdata command. For the time being, this returns the full tag no matter what argument you pass to it. I suppose this will suffice for the current uses with casing. So instead of checking for a particular babel version, you probably can simply check whether this command is defined at document begin.

We can add subtags later if required. This requires a bit more work, as polyglossia currently does only store full tags.

jbezos commented 1 year ago

@josephwright @u-fischer @jspitz

babel defines the following 4 fields whose names are (I think) obvious, because they are used in the BCP 47: language, script, region, and variant (eg, \BCPdata{script}). For the 3rd the CLDR seems to prefer territory, and polyglossia uses ‘variant’ with another meaning. Of course, I’m willing to create aliases if necessary.

Extensions are tentatively named extension.t and extension.u. The private use area is named extension.x which is not quite correct. Since these names are documented as tentative in babel, they may be changed.

jspitz commented 1 year ago

babel defines the following 4 fields whose names are (I think) obvious, because they are used in the BCP 47: language, script, region, and variant (eg, \BCPdata{script}). For the 3rd the CLDR seems to prefer territory, and polyglossia uses ‘variant’ with another meaning. Of course, I’m willing to create aliases if necessary.

I'll go with those as well for polyglossia (in the medium term), and probably extension.x which is used by latin. We don't have u and t extensions yet.

jspitz commented 1 year ago

@jbezos I am wondering whether it would make sense to add

a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference
a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

josephwright commented 1 year ago

I have added a very basic (expandable) \BCPdata command. For the time being, this returns the full tag no matter what argument you pass to it. I suppose this will suffice for the current uses with casing. So instead of checking for a particular babel version, you probably can simply check whether this command is defined at document begin.

We can add subtags later if required. This requires a bit more work, as polyglossia currently does only store full tags.

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

josephwright commented 1 year ago

I will look to adjust the kernel code for the next (2023-06-01) release: does this fit with everyone's thinking?

u-fischer commented 1 year ago

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

josephwright commented 1 year ago

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

jbezos commented 1 year ago

a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference

a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

jbezos commented 1 year ago

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

That’s mostly fine, but I thinking of something like this: we load Polytonic Greek, whose tag is el-polyton. I want to keep this tag for whatever reason, to correctly identify which Greek variant is in force (in the BCP 47 sense, and in this case in the polyglossia sense, too, iir). But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota. I’m not sure how this can be done.

josephwright commented 1 year ago

That’s mostly fine, but I thinking of something like this: we load Polytonic Greek, whose tag is el-polyton. I want to keep this tag for whatever reason, to correctly identify which Greek variant is in force (in the BCP 47 sense, and in this case in the polyglossia sense, too, iir). But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota. I’m not sure how this can be done.

Wouldn't that be el-polyton-x-iota? My reading of the x stuff is it can be used to add 'whatever' to the end of the tag. I suspect I may need to do more work if we want multiple x tag combos, but I think el-polyton-x-iota should already work: we do decomposition into <lang>-<filler>-x-<extension> so it should 'just work'.

jbezos commented 1 year ago

If it ‘just works’, great (actually, didn’t test).

josephwright commented 1 year ago

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

josephwright commented 1 year ago

a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference

a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

Should we open a kernel issue to hammer this out now we have broad agreement? The plan would be I think for the kernel to define the command and document the expected semantics, but itself do nothing, then leave it to polyglossia and babel to implement whatever parts of the documented API make sense.

u-fischer commented 1 year ago

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

But wouldn't you get in \MakeUppercase only el from \BCPdata{language}?

josephwright commented 1 year ago

If it ‘just works’, great (actually, didn’t test).

I did: this does work :)

But wouldn't you get in \MakeUppercase only el from \BCPdata{language}?

Er, yes: what I meant is that el-polyton-x-iota is read by the case code as el + x-iota and the polyton part is ignored.

jbezos commented 1 year ago

@josephwright Making tests. I’m still not sure what to do wrt to the last part of my question:

But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota.

Will the casing macros take the private area into account if the corresponding field is defined?

(btw, I wonder if this discussion should take place in the LaTeX repository.)

u-fischer commented 1 year ago

But I also want to activate the iota rules by default, always, without explicitly passing el-x-iota.

Will the casing macros take the private area into account if the corresponding field is defined?

I wondered if an indirection would make sense. So that \MakeUppercase uses say \BCPdata{casing}, and you could then define this e.g. as a combination of \BCPdata{language}-\BCPdata{whatever)?

jbezos commented 1 year ago

@u-fischer 👍 It’s my preferred option, indeed, which imo will make things easier for other packages (eg, xgreek) and for users who don’t like babel or polyglossia. And document what the casing macros expect (which, in fact, is already done).

josephwright commented 1 year ago

(btw, I wonder if this discussion should take place in the LaTeX repository.)

I think that might be best: we are really talking about something separate from the thread here

jspitz commented 1 year ago

Many thanks. I think we can work with this. Probably at some stage we might want to have \providecommand in the packages and \newcommand in the kernel (so no \ifdefined test is needed), but that is likely a refinement for a future 'pass'.

If I got it right, then the kernel can already now make \newcommand\BCPdata[1]{...} with some suitable default definition (and document what the output should be if packages overwrite the definition).

@jbezos, @jspitz That work for both of you?

Works for me.

jspitz commented 1 year ago

a way to access these info also for the main language (as opposed to the current language), e.g. \BCPdata{main.language}, as sometimes it is useful to know the difference

a standardized way to let \BCPdata output the BCP-47 tag as it is recorded in the gloss/ini file (\localeinfo*{tag.bcp47} in babel, \languageid{bcp47} in polyglossia), e.g. \BCPdata{tag} or simply with empty argument

Why not, but \BCPdata is more a LaTeX command, so I wonder what @josephwright and @u-fischer think. Since it’s not documented (at least in babel I didn’t documented it because it was essentially tentative and ‘unfinished’), we can still make some changes, or at least leave room for future changes.

I think it will be good if at least babel and polyglossia use the same arguments here (and then we both document it).

jspitz commented 1 year ago

As there is a kernel change in sight (and hopefully also hyperref), I've removed the patches. Let's get this right from the start. @josephwright I define \BCPdata in polyglossia via \DeclareExpandableDocumentCommand. This works with your plans, doesn't it?

josephwright commented 1 year ago

I've opened https://github.com/latex3/latex2e/issues/1035 as defining the interface really does need to be done at the kernel level.

josephwright commented 1 year ago

I suggest closing here at this point as the issue is being handled elsewhere

jspitz commented 1 year ago

I'll close as soon as I have implemented \BCPdata with all keywords (which will take me some time).

jspitz commented 1 year ago

Everything is done on polyglossia side now. Some subtags might need adjustment, but in general all data is in, and \BCPdata{casing} should always return the appropriate tag.

Closing this as "upstream" bug (for the record, this will be fixed with the next LaTeX kernel release, 2023/06).

jspitz commented 1 year ago

I have just committed poylglossia 1.61 to CTAN. This includes \BCPdata.

jspitz commented 1 year ago

@u-fischer with the latest hyperref release, the following should work right?

% !TeX TS-program = xelatex
\documentclass{article}
\usepackage{fontspec}
\setmainfont{DejaVu Serif}

\usepackage{polyglossia}
\setmainlanguage[variant=poly]{greek}

\usepackage[bookmarks]{hyperref}

\begin{document}
    \section{\MakeUppercase{άδικος, κείμενο, ίριδα}}
    άδικος, κείμενο, ίριδα → \MakeUppercase{άδικος, κείμενο, ίριδα}
\end{document}

still get wrong casing in the bookmarks.

Same with

\usepackage[greek.polutoniko]{babel}

jspitz commented 1 year ago

No, sorry, babel works (after an additional xelatex pass). So is there something wrong with the return value of polyglossia?

u-fischer commented 1 year ago

@jspitz your \BCPdata is not expandable.

\edef\blub{\BCPdata{casing}}\show\blub

gives

> \blub=macro:
->\tl_set:Nn casing{casing}\str_if_in:nnTF {casing}{main.}{\seq_set_split:Nnx \s__seq {.}casing\seq_get_right:NN \s__seq casing\str_set_eq:NN greekgreek}{\str_set_eq:NN greekgreek}\cs_set_nopar:cpx {tmpcasing}{}el.

In babel I get

> \blub=macro:
->el.

hyperref needs expandable commands in the bookmarks (and the kernel needs an expandable \BCPdata too).

jspitz commented 1 year ago

Any advise how to make it expandable is welcome.

jspitz commented 1 year ago

I have reworked \BCPdata at 8a5a2fd8c7e7041a4. It is now expandable. A hotfix release (1.61b) will go to CTAN later today.

reutenauer / polyglossia

Upcasing in Greek no longer drops diacritics #581