latex3 / latex3

The expl3 (LaTeX3) Development Repository
https://latex-project.org/latex3.html
LaTeX Project Public License v1.3c
1.93k stars 189 forks source link

Fix `\MakeUppercase` regarding Greek upcasing rules with LICR input. #1231

Closed gmilde closed 1 year ago

gmilde commented 1 year ago

Remaining issue from https://github.com/latex3/latex2e/issues/987.

With LICR input:

Diacritics not dropped if input using standard accent macros (\' \~ ...). This can be solved for pdflatex, but may require a different configuration interface (or just better documentation) for xe/lualatex.

The "usrguide" states:

The input given to these commands is ‘expanded’ before case changing is applied. This means that any commands within the input that convert to pure text will be case changed.

However, LICR input and literal input are handled differently regarding the Greek uppercase rules.

After loading babel-greek, e.g. the LICR \'\textalpha is converted to character ά (03AC GREEK SMALL LETTER ALPHA WITH TONOS).

However, with xelatex or lualatex, the minimal example

\documentclass[a4paper]{article}
\usepackage[greek]{babel}
%\usepackage[greek,english,provide=*]{babel}    % Babel's Greek "ini"

\usepackage{fontspec}
\setmainfont{FreeSerif}

\begin{document}

\newcommand*{\testsample}{ά, \acctonos\textalpha, \'\textalpha}
\langGreek{\testsample{} → \MakeUppercase{\testsample{}}}

\end{document}

results in

ά, ά, ά → Α, Α, Ά

\DeclareUnicodeAccent{\acctonos}    \UnicodeEncodingName{"0301}
[...]
\DeclareUnicodeComposite{\'}                {\textalpha}  {"03AC} % ά
[...]
\DeclareUnicodeComposite{\acctonos}         {\textalpha}  {"03AC} % ά

It seems as if the LICRs for GREEK SMALL LETTER ALPHA WITH TONOS are only partially expanded and the correct upcasing of \acctonos\textalpha is due to greek-fontenc.def which extends the \@uclclist with the mapping \acctonos\LGR@hiatus (the latter prints its argument without diacritic and adds a dialytika on the second-next vowel if required for disambiguation).

However, a \@uclclist mapping of the standard accent macro \' would also affect Latin and Cyrillic characters. This is OK with 8-bit TeX, where LGR maps Latin to Greek anyway but not for Unicode fonts (TU).

I experimented with a mapping \'\accACUTE, a default to keep the accent, and Composite definitions dropping it, but did not manage to solve the problem.

I wonder whether there is more detailled documentation on the working of the case-changing code. For configuration, I could imagine

josephwright commented 1 year ago

The "usrguide" states:

The input given to these commands is ‘expanded’ before case changing is applied. This means that any commands within the input that convert to pure text will be case changed.

However, LICR input and literal input are handled differently regarding the Greek uppercase rules.

Indeed, that's exactly because they are expanded in a way that tries to retain as far as possible the user's choice of representation. If you try for example

\documentclass{article}
\usepackage[greek]{babel}
\makeatletter
\protected@edef\foo{\'\textalpha}\show\foo

you get \'\textalpha, which is not the same at all as ά (at least from a token-handling point of view). So the code that's set up to deal with Greek accents doesn't see the LICR version at all.

Whilst we could arrange to look for all of the combinations (\'α, etc. as well as \'\textalpha), I am not sure that is the best approach. You end up with two code paths for the same ideas, and that's asking for subtle differences. (That was the reason I moved from having separate 8-bit and Unicode handling in the first place.) We could also do a 'partial' expansion of e.g. \textalpha to α within Greek blocks, but I am worried that is error-prone.

The wider point is not just an issue for case changing. If you want to write e.g. PDF bookmarks, you need the 'pure text' equivalent of the input, which is currently handled as a separate step (and needs data which is a bit spread out). Similarly, if someone wants to map over the graphemes in some text, they'd also face the same issues. So we do need a mechanism to convert the LICRs, it's a question of where it sits.

I suspect that the best long-term fix is to adjust the 'text expansion' code such that LICRs that can be mapped to Unicode codepoints are. There would remain some issue with those that require combining chars, as they can't neatly be handled in 8-bit engines. (I suspect there an engine-dependent pathway is more-or-less inevitable.)

However, this is quite a significant change at a policy level, so I'd like wider input.

josephwright commented 1 year ago

In favour of the status quo for treatment of LICRs in expansion is that one can't (at present) be sure that LICR -> Unicode will round-trip. That's fine for \text_purify:n, but not for case changing, grapheme mapping, etc., as we likely will need to typeset the result and this could rely on the LICR. (That's on top of the combining chars issues.)

A selective 'more expansion' mechanism would presumably not have that issue, as it would be clearly opt-in and so it would be reasonable to assume round-tripping.

gmilde commented 1 year ago

I managed to solve the Greek upcasing for LICR and literal characers with short accents in about 120 code lines. The following example combines the code required on top of an current TL23 in the preamble and a test document.

% Backrolling does not work for \MakeUppercase (cf. LaTeX News 35)
% \RequirePackage{latexbug}
%\RequirePackage[2022-05-01]{latexrelease}

\documentclass[a4paper]{article}

\usepackage[LGR,T1]{fontenc}
\usepackage{lmodern}

\ifdefined \UnicodeEncodingName
  \usepackage{fontspec}
  \setmainfont{FreeSerif}
  \newcommand*{\texengine}{Xe/LuaLaTeX}
\else
  \usepackage{lmodern}
  \newcommand*{\texengine}{pdfLaTeX}
\fi

% Load encoding definitions
\usepackage[normalize-symbols]{textalpha}  % "Greek script everywhere"

% With TL22, the special handling of Greek UPPERCASE is only triggered
% if the text language is set to "greek" with Babel:
%
\usepackage[greek,english]{babel}  % babel-greek
% \usepackage[greek,english,provide=*]{babel}    % Babel's Greek "ini"
\languageattribute{greek}{polutoniko}  % "modern" polytonic Greek
% \languageattribute{greek}{ancient}

\usepackage[unicode,colorlinks,linkcolor=blue]{hyperref}
\usepackage{bookmark}

% Auxiliary commands

\newcommand{\langGreek}{\foreignlanguage{greek}}

% print the selected language variant
\newcommand{\GreekLanguageVariant}{%
  \ifx\captionsgreek\captionspolutonikogreek
    \ifx\captionsgreek\captionsancientgreek
      ancient%
    \else
      polutoniko%
    \fi
  \else
    monotoniko%
  \fi
}

% workaround for MakeUppercase:

\makeatletter

% for textalpha.sty (already present in 2.4dev)
\ifdefined\DeclareCaseChangeEquivalent % new in 2023
  \DeclareCaseChangeEquivalent{\<}{\accdasia}
  \DeclareCaseChangeEquivalent{\>}{\accpsili}
\fi

% for Babel (already present for LGR in 1.13.2)
\IfFormatAtLeastTF{2022/06/01}%
  {\DeclareTextCommandDefault{\accACUTE}{\@tabacckludge'}
   \DeclareTextCommandDefault{\accGRAVE}{\@tabacckludge`}
   \DeclareTextCommandDefault{\accTILDE}{\@tabacckludge~}
   \addto\@uclclist{\'\accACUTE \`\accGRAVE \~\accTILDE}%
  }%
  {}%

\ifdefined \UnicodeEncodingName
 \IfFormatAtLeastTF{2022/06/01}{%

% already in greek-fontenc 2.3.dev
  \DeclareTextCompositeCommand{\LGR@hiatus}{TU}{'}{\LGR@hiatus}
  \DeclareTextCompositeCommand{\LGR@hiatus}{TU}{`}{\LGR@accdropped}

% for tuenc-greek.def
  \DeclareTextCommand{\accACUTE}{TU}{\@tabacckludge '}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{"}{\accdialytika}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{>}{\LGR@hiatus}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textAlpha  }{\LGR@A@hiatus}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textEpsilon}{\LGR@E@hiatus}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textEta    }{Η}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textIota   }{Ι}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textOmicron}{Ο}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textUpsilon}{Υ}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{\textOmega  }{Ω}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Α}{\LGR@A@hiatus}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Ε}{\LGR@E@hiatus}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Η}{Η}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Ι}{Ι}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Ο}{Ο}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Υ}{Υ}
  \DeclareTextCompositeCommand{\accACUTE}{TU}{Ω}{Ω}

  \DeclareTextCompositeCommand{\accdialytikatonos}{TU}{\textIota}{Ϊ}
  \DeclareTextCompositeCommand{\accdialytikatonos}{TU}{\textUpsilon}{Ϋ}
  \DeclareTextCompositeCommand{\accdialytikatonos}{TU}{Ι}{Ϊ}
  \DeclareTextCompositeCommand{\accdialytikatonos}{TU}{Υ}{Ϋ}

  \DeclareTextCommand{\accGRAVE}{TU}{\@tabacckludge`}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{"}{\accdialytika}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{>}{\LGR@accdropped}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textAlpha  }{Α}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textEpsilon}{Ε}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textEta    }{Η}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textIota   }{Ι}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textOmicron}{Ο}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textUpsilon}{Υ}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{\textOmega  }{Ω}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Α}{Α}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Ε}{Ε}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Η}{Η}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Ι}{Ι}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Ο}{Ο}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Υ}{Υ}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Ω}{Ω}
  \DeclareTextCompositeCommand{\accGRAVE}{TU}{Ω}{Ω}

  \DeclareTextCommand{\accTILDE}{TU}{\@tabacckludge~}
  \DeclareTextCompositeCommand{\accTILDE}{TU}{"}{\accdialytika}
  \DeclareTextCompositeCommand{\accTILDE}{TU}{>}{\LGR@accdropped}
  \DeclareTextCompositeCommand{\accTILDE}{TU}{<}{\LGR@accdropped}

  \DeclareTextCompositeCommand{\LGR@hiatus}{TU}{Α}{\LGR@A@hiatus}
  \DeclareTextCompositeCommand{\LGR@hiatus}{TU}{Ε}{\LGR@E@hiatus}

  \DeclareTextCompositeCommand{\LGR@accdropped}{TU}{'}{\LGR@accdropped}
  \DeclareTextCompositeCommand{\LGR@accdropped}{TU}{`}{\LGR@accdropped}

  \DeclareTextCommand{\accpsilioxia}{TU}[1]{#1\char"0313\relax\char"0301\relax}

 }{} % end IfFormatAtLeastTF
\fi

\makeatother

% -----------------------------------------------------------------------

\begin{document}

\title{Test case conversions of Greek letters}
\author{Günter Milde}
\maketitle

\tableofcontents

\abstract{
This document tests the combination of \verb|MakeUppercase| and Greek.

\makeatletter
It is compiled with \texengine, format version \fmtversion{} patch-level
\patch@level{} and the L3 programming layer from \ExplFileDate{}.
The \verb|\greekfontencoding| is \greekfontencoding.
\makeatother
}

\section{short accent macros}

This section compares literal Unicode Greek characters to characters input
using LICR macros.

Accents on Latin letters must be kept:
\'a \`a \~a → \MakeUppercase{\'a \`a \~a}

\subsection{Greek and Coptic}

\newcommand{\GreekAndCoptic}{% only characters supported by LGR
  \raggedright

  ΄  ΅  Ά  ·  Έ  Ή  Ί  ␣  Ό  ␣  Ύ  Ώ  \\

  \'{ } \"'{ } \'\textAlpha{} \textanoteleia{}
  \'\textEpsilon{} \'\textEta{} \'\textIota{}
  ␣ \'\textOmicron{} ␣ \'\textUpsilon{} \'\textOmega{} \\

  ΐ  Ω  Ϊ  Ϋ  ά  έ  ή  ί  \\
  \"'\textiota{}
  \textOmega{} \"\textIota{}
  \"\textUpsilon{} \'\textalpha{} \'\textepsilon{} \'\texteta{}
  \'\textiota{} \\

  ΰ
  ϊ  ϋ  ό  ύ  ώ  ␣  \\
  \'"\textupsilon{}
  \"\textiota{}
  \"\textupsilon{} \'\textomicron{} \'\textupsilon{} \'\textomega{} ␣\\
}

No case change:
\begin{quote}
  \selectlanguage{greek}
  \GreekAndCoptic
\end{quote}
%
MakeUppercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeUppercase{\GreekAndCoptic}
\end{quote}
%
MakeLowercase:
\begin{quote}
 \selectlanguage{greek}
  \MakeLowercase{\GreekAndCoptic}
\end{quote}

% \end{document}

\subsection{Greek extended}

\newcommand{\GreekExtended}{\raggedright
  ἀ   ἁ   ἂ   ἃ   ἄ   ἅ   ἆ    ἇ    Ἀ   Ἁ   Ἂ   Ἃ   Ἄ   Ἅ   Ἆ  Ἇ \\
  \>\textalpha{}
  \<\textalpha{}
  \`>\textalpha{}
  \<`\textalpha{}
  \>'\textalpha{}
  \<'\textalpha{}
  \>~\textalpha{}
  \<~\textalpha{}
  \>\textAlpha{}
  \<\textAlpha{}
  \>`\textAlpha{}
  \<`\textAlpha{}
  \>'\textAlpha{}
  \<'\textAlpha{}
  \~>\textAlpha{}
  \~<\textAlpha{} \\

  ἐ   ἑ   ἒ   ἓ   ἔ   ἕ   ␣    ␣    Ἐ   Ἑ   Ἒ   Ἓ   Ἔ   Ἕ        \\
  \>\textepsilon{}
  \<\textepsilon{}
  \>`\textepsilon{}
  \<`\textepsilon{}
  \>'\textepsilon{}
  \<'\textepsilon{}
  ␣ ␣ \>\textEpsilon{}
  \<\textEpsilon{}
  \>`\textEpsilon{}
  \<`\textEpsilon{}
  \>'\textEpsilon{}
  \<'\textEpsilon{}\\

  ἠ   ἡ   ἢ   ἣ   ἤ   ἥ   ἦ    ἧ    Ἠ   Ἡ   Ἢ   Ἣ   Ἤ   Ἥ   Ἦ  Ἧ \\
  \>\texteta{}
  \<\texteta{}
  \>`\texteta{}
  \<`\texteta{}
  \>'\texteta{}
  \<'\texteta{}
  \~>\texteta{}
  \~<\texteta{}
  \>\textEta{}
  \<\textEta{}
  \>`\textEta{}
  \<`\textEta{}
  \'>\textEta{}
  \<'\textEta{}
  \~>\textEta{}
  \~<\textEta{} \\

  ἰ   ἱ   ἲ   ἳ   ἴ   ἵ   ἶ    ἷ    Ἰ   Ἱ   Ἲ   Ἳ   Ἴ   Ἵ   Ἶ  Ἷ \\
  \>\textiota{}
  \<\textiota{}
  \>`\textiota{}
  \<`\textiota{}
  \>'\textiota{}
  \<'\textiota{}
  \~>\textiota{}
  \~<\textiota{}
  \>\textIota{}
  \<\textIota{}
  \>`\textIota{}
  \<`\textIota{}
  \>'\textIota{}
  \<'\textIota{}
  \~>\textIota{}
  \~<\textIota{} \\

  ὀ   ὁ   ὂ   ὃ   ὄ   ὅ   ␣    ␣    Ὀ   Ὁ   Ὂ   Ὃ   Ὄ   Ὅ        \\
  \>\textomicron{}
  \<\textomicron{}
  \>`\textomicron{}
  \<`\textomicron{}
  \>'\textomicron{}
  \<'\textomicron{}
  ␣ ␣ \>\textOmicron{}
  \<\textOmicron{}
  \>`\textOmicron{}
  \<`\textOmicron{}
  \>'\textOmicron{}
  \<'\textOmicron{} \\

  ὐ   ὑ   ὒ   ὓ   ὔ   ὕ   ὖ    ὗ    ␣   Ὑ   ␣   Ὓ   ␣   Ὕ   ␣  Ὗ \\
  \>\textupsilon{}
  \<\textupsilon{}
  \>`\textupsilon{}
  \<`\textupsilon{}
  \>'\textupsilon{}
  \<'\textupsilon{}
  \~>\textupsilon{}
  \~<\textupsilon{}
  ␣ \<\textUpsilon{}
  ␣ \<`\textUpsilon{}
  ␣ \<'\textUpsilon{}
  ␣ \~<\textUpsilon{} \\

  ὠ   ὡ   ὢ   ὣ   ὤ   ὥ   ὦ    ὧ    Ὠ   Ὡ   Ὢ   Ὣ   Ὤ   Ὥ   Ὦ  Ὧ \\
  \>\textomega{}
  \<\textomega{}
  \>`\textomega{}
  \<`\textomega{}
  \>'\textomega{}
  \<'\textomega{}
  \~>\textomega{}
  \~<\textomega{}
  \>\textOmega{}
  \<\textOmega{}
  \>`\textOmega{}
  \<`\textOmega{}
  \>'\textOmega{}
  \<'\textOmega{}
  \~>\textOmega{}
  \~<\textOmega{} \\

  ὰ   ά   ὲ   έ   ὴ   ή   ὶ    ί    ὸ   ό   ὺ   ύ   ὼ   ώ        \\
  \`\textalpha{}
  \'\textalpha{}
  \`\textepsilon{}
  \'\textepsilon{}
  \`\texteta{}
  \'\texteta{}
  \`\textiota{}
  \'\textiota{}
  \`\textomicron{}
  \'\textomicron{}
  \`\textupsilon{}
  \'\textupsilon{}
  \`\textomega{}
  \'\textomega{} \\

  ᾀ   ᾁ   ᾂ   ᾃ   ᾄ   ᾅ   ᾆ    ᾇ    ᾈ   ᾉ   ᾊ   ᾋ   ᾌ   ᾍ   ᾎ  ᾏ \\
  \>\textalpha\ypogegrammeni{}
  \<\textalpha\ypogegrammeni{}
  \>`\textalpha\ypogegrammeni{}
  \<`\textalpha\ypogegrammeni{}
  \>'\textalpha\ypogegrammeni{}
  \<'\textalpha\ypogegrammeni{}
  \~>\textalpha\ypogegrammeni{}
  \~<\textalpha\ypogegrammeni{}
  \>\textAlpha\ypogegrammeni{}
  \<\textAlpha\ypogegrammeni{}
  \>`\textAlpha\ypogegrammeni{}
  \<`\textAlpha\ypogegrammeni{}
  \>'\textAlpha\ypogegrammeni{}
  \<'\textAlpha\ypogegrammeni{}
  \~>\textAlpha\ypogegrammeni{}
  \~<\textAlpha\ypogegrammeni{} \\

  ᾐ   ᾑ   ᾒ   ᾓ   ᾔ   ᾕ   ᾖ    ᾗ    ᾘ   ᾙ   ᾚ   ᾛ   ᾜ   ᾝ   ᾞ  ᾟ \\
  \>\texteta\ypogegrammeni{}
  \<\texteta\ypogegrammeni{}
  \>`\texteta\ypogegrammeni{}
  \<`\texteta\ypogegrammeni{}
  \>'\texteta\ypogegrammeni{}
  \<'\texteta\ypogegrammeni{}
  \~>\texteta\ypogegrammeni{}
  \~<\texteta\ypogegrammeni{}
  \>\textEta\ypogegrammeni{}
  \<\textEta\ypogegrammeni{}
  \>`\textEta\ypogegrammeni{}
  \<`\textEta\ypogegrammeni{}
  \>'\textEta\ypogegrammeni{}
  \<'\textEta\ypogegrammeni{}
  \>~\textEta\ypogegrammeni{}
  \<~\textEta\ypogegrammeni{} \\

  ᾠ   ᾡ   ᾢ   ᾣ   ᾤ   ᾦ   ᾧ    ᾥ    ᾨ   ᾩ   ᾪ   ᾫ   ᾬ   ᾭ   ᾮ  ᾯ \\
  \>\textomega\ypogegrammeni{}
  \<\textomega\ypogegrammeni{}
  \>`\textomega\ypogegrammeni{}
  \<`\textomega\ypogegrammeni{}
  \>'\textomega\ypogegrammeni{}
  \<'\textomega\ypogegrammeni{}
  \~>\textomega\ypogegrammeni{}
  \~<\textomega\ypogegrammeni{}
  \>\textOmega\ypogegrammeni{}
  \<\textOmega\ypogegrammeni{}
  \>`\textOmega\ypogegrammeni{}
  \<`\textOmega\ypogegrammeni{}
  \>'\textOmega\ypogegrammeni{}
  \<'\textOmega\ypogegrammeni{}
  \~>\textOmega\ypogegrammeni{}
  \~<\textOmega\ypogegrammeni{} \\

  ᾰ   ᾱ   ᾲ   ᾳ   ᾴ   ␣   ᾶ    ᾷ    Ᾰ   Ᾱ   Ὰ   Ά   ᾼ   ᾽   ι  ᾿ \\
  \u\textalpha{}
  \=\textalpha{}
  \`\textalpha\ypogegrammeni{}
  \textalpha\ypogegrammeni{}
  \'\textalpha\ypogegrammeni{}
  ␣ \~\textalpha{}
  \~\textalpha\ypogegrammeni{}
  \u\textAlpha{}
  \=\textAlpha{}
  \`\textAlpha{}
  \'\textAlpha{}
  \textAlpha\ypogegrammeni{}
  \>{}
  \prosgegrammeni{}
  \>{} \\

  ῀   ῁   ῂ   ῃ   ῄ   ␣   ῆ    ῇ    Ὲ   Έ   Ὴ   Ή   ῌ   ῍   ῎  ῏ \\
  \~{}
  \"\~{}
  \`\texteta\ypogegrammeni{}
  \texteta\ypogegrammeni{}
  \'\texteta\ypogegrammeni{}
  ␣ \~\texteta{}
  \~\texteta\ypogegrammeni{}
  \`\textEpsilon{}
  \'\textEpsilon{}
  \`\textEta{}
  \'\textEta{}
  \textEta\ypogegrammeni{}
  \>`{}
  \>'{}
  \~>{} \\

  ῐ   ῑ   ῒ   ΐ   ␣   ␣   ῖ    ῗ    Ῐ   Ῑ   Ὶ   Ί   ␣   ῝   ῞  ῟ \\
  \u\textiota{}
  \=\textiota{}
  \`"\textiota{}
  \'"\textiota{}
  ␣ ␣ \~\textiota{}
  \~"\textiota{}
  \u\textIota{}
  \=\textIota{}
  \`\textIota{}
  \'\textIota{}
  ␣
  \<`{}
  \<'{}
  \~<{} \\

  ῠ   ῡ   ῢ   ΰ   ῤ    ῥ    ῦ   ῧ   Ῠ  Ῡ  Ὺ   Ύ   Ῥ   ῭   ΅  ` \\
  \u\textupsilon{}
  \=\textupsilon{}
  \`"\textupsilon{}
  \'"\textupsilon{}
  \>\textrho{}
  \<\textrho{}
  \~\textupsilon{}
  \~"\textupsilon{}
  \u\textUpsilon{}
  \=\textUpsilon{}
  \`\textUpsilon{}
  \'\textUpsilon{}
  \<\textRho{}
  \`"{}
  \'"{}
  \`{} \\

  ␣   ␣   ῲ   ῳ   ῴ   ␣   ῶ    ῷ    Ὸ   Ό   Ὼ   Ώ   ῼ   ´   ῾  ␣ \\

  ␣ ␣ \`\textomega\ypogegrammeni{}
  \textomega\ypogegrammeni{}
  \'\textomega\ypogegrammeni{}
  ␣ \~\textomega{}
  \~\textomega\ypogegrammeni{}
  \`\textOmicron{}
  \'\textOmicron{}
  \`\textOmega{}
  \'\textOmega{}
  \textOmega\ypogegrammeni{}
  \'{}
  \<{} ␣
}

No case change:
\begin{quote}
  \selectlanguage{greek}
  \GreekExtended
\end{quote}
%
MakeUppercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeUppercase{\GreekExtended}
\end{quote}
%
MakeLowercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeLowercase{\GreekExtended}
\end{quote}

\subsection{Hiatus}

Tonos and psili mark a \emph{hiatus} (break-up of a diphthong) if
placed on the first vowel of a diphthong.
A dialytika must be placed on the second vowel if they are dropped, e.g.
%
\newcommand{\HiatusNamed}{\acctonos\textalpha\textiota,
                         \acctonos\textalpha\textupsilon,
                         \accpsilioxia\textalpha\textiota,
                         \accpsili\accoxia\textalpha\textupsilon,
                         \accpsili\textalpha\textupsilon,
                         \acctonos\textepsilon\textiota,
                         \accoxia\textepsilon\textiota}%
\ensuregreek{\HiatusNamed\ $\mapsto$ \MakeUppercase{\HiatusNamed}}.

Some affected words:
\begin{quotation}
  \selectlanguage{greek}
  \newcommand*{\aylos}{% from teubner.sty: άυλος → ΑΫΛΟΣ
    \acctonos\textalpha\textupsilon\textlambda\textomicron\textfinalsigma}
  \aylos{} $\mapsto$ \MakeUppercase{\aylos},
  \renewcommand*{\aylos}{% polytonic: ἄυλος → ΑΫΛΟΣ
    \'>\textalpha\textupsilon\textlambda\textomicron\textfinalsigma}
  \aylos{} $\mapsto$ \MakeUppercase{\aylos},
  % https://lsj.gr/wiki/ἀυπνία
  \newcommand*{\ahypnia}{% ἀυπνία → ΑΫΠΝΙΑ
    \accpsili\textalpha\textupsilon\textpi\textnu\acctonos\textiota\textalpha}
  \ahypnia{} $\mapsto$ \MakeUppercase{\ahypnia},

  % from http://diacritics.typo.cz/index.php?id=69
  \newcommand*{\maina}{%μάινα → ΜΑΪΝΑ
    \textmu\acctonos\textalpha\textiota\textnu\textalpha}
  \maina{} $\mapsto$ \MakeUppercase{\maina},
  % from  http://de.wikipedia.org/wiki/Neugriechische_Orthographie#Das_Trema
  \newcommand*{\keik}{% κέικ → ΚΕΪΚ
    \textkappa\acctonos\textepsilon\textiota\textkappa}
  \keik{} $\mapsto$ \MakeUppercase{\keik},
  % from http://multilingualtypesetting.co.uk/blog/greek-typesetting-tips/
  \newcommand*{\romeika}{\textrho\textomega\textmu
                         \acctonos\textepsilon\textiota\textkappa\textalpha}
  \romeika{} $\mapsto$ \MakeUppercase{\romeika}.
\end{quotation}

With the pre-2022/06 \verb|\MakeUppercase|, automatic upcasing of words with
\emph{hiatus} works correctly only if the accents are input as macro and the
letters as macro or via the Latin transliteration.

Hiatus examples with short accent macros and LICR:

\newcommand{\HiatusShort}{\'\textalpha\textiota,
                         \'\textalpha\textupsilon,
                         \>'\textalpha\textupsilon,
                         \'>\textalpha\textupsilon,
                         \>\textalpha\textupsilon,
                         \'\textepsilon\textiota,
                         \>\textalpha\textupsilon,
                         \>'\textepsilon\textiota,
                         \'>\textepsilon\textiota
                        }%
\ensuregreek{\HiatusShort\ $\mapsto$ \MakeUppercase{\HiatusShort}}.

\section{short accent macros + literal character}

This section compares literal Unicode Greek characters to characters input
using accent macros and the literal base character.

\ifdefined \UnicodeEncodingName
\else
  \begin{quote} \em
  Skipped, as accent macros on a Greek literal Unicode character lead
  to errors.
  \end{quote}
  \end{document}
\fi

\subsection{Greek and Coptic}

\renewcommand{\GreekAndCoptic}{% only characters supported by LGR
  \raggedright
  ␣  ␣  ␣  ␣  ΄  ΅       Ά ·   Έ   Ή   Ί ␣   Ό ␣   Ύ   Ώ \\
  ␣  ␣  ␣  ␣  ΄ \"'{ } \'Α · \'Ε \'Η \'Ι ␣ \'Ο ␣ \'Υ \'Ω \\

     ΐ  Α  Β  Γ Δ  Ε  Ζ  Η  Θ  Ι  Κ  Λ  Μ  Ν  Ξ  Ο \\
  \'"ι  Α  Β  Γ Δ  Ε  Ζ  Η  Θ  Ι  Κ  Λ  Μ  Ν  Ξ  Ο \\

  Π  Ρ  ␣  Σ  Τ  Υ  Φ  Χ  Ψ  Ω   Ϊ   Ϋ   ά   έ   ή   ί \\
  Π  Ρ  ␣  Σ  Τ  Υ  Φ  Χ  Ψ  Ω \"Ι \"Υ \'α \'ε \'η \'ι \\

  ΰ    α  β  γ  δ  ε  ζ  η  θ  ι  κ  λ  μ  ν  ξ  ο \\
  \"'υ α  β  γ  δ  ε  ζ  η  θ  ι  κ  λ  μ  ν  ξ  ο \\

  π  ρ  ς  σ  τ  υ  φ  χ  ψ  ω   ϊ   ϋ   ό   ύ   ώ ␣ \\
  π  ρ  ς  σ  τ  υ  φ  χ  ψ  ω \"ι \"υ \'ο \'υ \'ω ␣ \\
}

No case change:

\begin{quote}
  \selectlanguage{greek}
  \GreekAndCoptic
\end{quote}
%
MakeUppercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeUppercase{\GreekAndCoptic}
\end{quote}
%
MakeLowercase:
\begin{quote}
 \selectlanguage{greek}
  \MakeLowercase{\GreekAndCoptic}
\end{quote}

\subsection{Greek extended}

\renewcommand{\GreekExtended}{\raggedright
  ἀ   ἁ   ἂ   ἃ   ἄ   ἅ   ἆ    ἇ    Ἀ   Ἁ   Ἂ   Ἃ   Ἄ   Ἅ   Ἆ  Ἇ \\
  \>α \<α \`>α \<`α \>'α \<'α \~>α \~<α
  \>Α \<Α \>`Α \<`Α \>'Α \<'Α \~>Α \~<Α \\

  ἐ   ἑ   ἒ   ἓ   ἔ   ἕ   ␣    ␣    Ἐ   Ἑ   Ἒ   Ἓ   Ἔ   Ἕ        \\
  \>ε \<ε \>`ε \<`ε \>'ε \<'ε ␣ ␣
  \>Ε \<Ε \>`Ε \<`Ε \>'Ε \<'Ε\\

  ἠ   ἡ   ἢ   ἣ   ἤ   ἥ   ἦ    ἧ    Ἠ   Ἡ   Ἢ   Ἣ   Ἤ   Ἥ   Ἦ  Ἧ \\
  \>η \<η \>`η \<`η \>'η \<'η \~>η \~<η
  \>Η \<Η \>`Η \<`Η \'>Η \<'Η \~>Η \~<Η \\

  ἰ   ἱ   ἲ   ἳ   ἴ   ἵ   ἶ    ἷ    Ἰ   Ἱ   Ἲ   Ἳ   Ἴ   Ἵ   Ἶ  Ἷ \\
  \>ι \<ι \>`ι \<`ι \>'ι \<'ι \~>ι \~<ι
  \>Ι \<Ι \>`Ι \<`Ι \>'Ι \<'Ι \~>Ι \~<Ι \\

    ὀ   ὁ    ὂ    ὃ    ὄ    ὅ ␣ ␣   Ὀ   Ὁ    Ὂ    Ὃ    Ὄ    Ὅ    \\
  \>ο \<ο \>`ο \<`ο \>'ο \<'ο ␣ ␣ \>Ο \<Ο \>`Ο \<`Ο \>'Ο \<'Ο    \\

  ὐ   ὑ   ὒ   ὓ   ὔ   ὕ   ὖ    ὗ    ␣   Ὑ   ␣   Ὓ   ␣   Ὕ   ␣  Ὗ \\
  \>υ \<υ \>`υ \<`υ \>'υ \<'υ \~>υ \~<υ ␣ \<Υ ␣ \<`Υ ␣ \<'Υ ␣ \~<Υ \\

  ὠ   ὡ   ὢ   ὣ   ὤ   ὥ   ὦ    ὧ    Ὠ   Ὡ   Ὢ   Ὣ   Ὤ   Ὥ   Ὦ  Ὧ \\
  \>ω \<ω \>`ω \<`ω \>'ω \<'ω \~>ω \~<ω
  \>Ω \<Ω \>`Ω \<`Ω \>'Ω \<'Ω \~>Ω \~<Ω \\

    ὰ   ά   ὲ   έ   ὴ   ή   ὶ   ί   ὸ   ό   ὺ   ύ   ὼ   ώ        \\
  \`α \'α \`ε \'ε \`η \'η \`ι \'ι \`ο \'ο \`υ \'υ \`ω \'ω        \\

  ᾀ   ᾁ   ᾂ   ᾃ   ᾄ   ᾅ   ᾆ    ᾇ    ᾈ   ᾉ   ᾊ   ᾋ   ᾌ   ᾍ   ᾎ  ᾏ \\
  \>α\ypogegrammeni{}
  \<α\ypogegrammeni{}
  \>`α\ypogegrammeni{}
  \<`α\ypogegrammeni{}
  \>'α\ypogegrammeni{}
  \<'α\ypogegrammeni{}
  \~>α\ypogegrammeni{}
  \~<α\ypogegrammeni{}
  \>Α\ypogegrammeni{}
  \<Α\ypogegrammeni{}
  \>`Α\ypogegrammeni{}
  \<`Α\ypogegrammeni{}
  \>'Α\ypogegrammeni{}
  \<'Α\ypogegrammeni{}
  \~>Α\ypogegrammeni{}
  \~<Α\ypogegrammeni{} \\

  ᾐ   ᾑ   ᾒ   ᾓ   ᾔ   ᾕ   ᾖ    ᾗ    ᾘ   ᾙ   ᾚ   ᾛ   ᾜ   ᾝ   ᾞ  ᾟ \\
  \>η\ypogegrammeni{}
  \<η\ypogegrammeni{}
  \>`η\ypogegrammeni{}
  \<`η\ypogegrammeni{}
  \>'η\ypogegrammeni{}
  \<'η\ypogegrammeni{}
  \~>η\ypogegrammeni{}
  \~<η\ypogegrammeni{}
  \>η\ypogegrammeni{}
  \<η\ypogegrammeni{}
  \>`η\ypogegrammeni{}
  \<`η\ypogegrammeni{}
  \>'η\ypogegrammeni{}
  \<'η\ypogegrammeni{}
  \~>η\ypogegrammeni{}
  \~<η\ypogegrammeni{} \\

  ᾠ   ᾡ   ᾢ   ᾣ   ᾤ   ᾦ   ᾧ    ᾥ    ᾨ   ᾩ   ᾪ   ᾫ   ᾬ   ᾭ   ᾮ  ᾯ \\
  \>ω\ypogegrammeni{}
  \<ω\ypogegrammeni{}
  \>`ω\ypogegrammeni{}
  \<`ω\ypogegrammeni{}
  \>'ω\ypogegrammeni{}
  \<'ω\ypogegrammeni{}
  \~>ω\ypogegrammeni{}
  \~<ω\ypogegrammeni{}
  \>ω\ypogegrammeni{}
  \<ω\ypogegrammeni{}
  \>`ω\ypogegrammeni{}
  \<`ω\ypogegrammeni{}
  \>'ω\ypogegrammeni{}
  \<'ω\ypogegrammeni{}
  \~>ω\ypogegrammeni{}
  \~<ω\ypogegrammeni{} \\

  ᾰ   ᾱ   ᾲ   ᾳ   ᾴ   ␣   ᾶ    ᾷ    Ᾰ   Ᾱ   Ὰ   Ά   ᾼ   ᾽   ι  ᾿ \\
  \u{α}
  \=α
  \`α\ypogegrammeni{}
  α\ypogegrammeni{}
  \'α\ypogegrammeni{}
  ␣ \~α
  \~α\ypogegrammeni{}
  \u{Α}
  \=Α
  \`Α
  \'Α
  Α\ypogegrammeni{}
  \>{}
  \prosgegrammeni{}
  \>{} \\

  ῀   ῁   ῂ   ῃ   ῄ   ␣   ῆ    ῇ    Ὲ   Έ   Ὴ   Ή   ῌ   ῍   ῎  ῏ \\
  \~{}
  \"\~{}
  \`η\ypogegrammeni{}
  η\ypogegrammeni{}
  \'η\ypogegrammeni{}
  ␣ \~η
  \~η\ypogegrammeni{}
  \`Ε
  \'Ε
  \`Η
  \'Η
  η\ypogegrammeni{}
  \>`{}
  \>'{}
  \~>{} \\

     ῐ    ῑ    ῒ    ΐ ␣ ␣   ῖ    ῗ    Ῐ    Ῑ   Ὶ   Ί ␣    ῝    ῞    ῟   \\
  \u{ι} \=ι \`"ι \'"ι ␣ ␣ \~ι \~"ι \u{Ι} \=Ι \`Ι \'Ι ␣ \<`{} \<'{} \~<{} \\

  ῠ   ῡ   ῢ   ΰ   ῤ    ῥ    ῦ   ῧ   Ῠ  Ῡ  Ὺ   Ύ   Ῥ   ῭   ΅  ` \\
  \u{υ} \=υ \`"υ \'"υ \>ρ \<ρ \~υ \~"υ
  \u{Υ} \=Υ \`Υ \'Υ \<Ρ \`"{} \'"{} \`{} \\

  ␣   ␣   ῲ   ῳ   ῴ   ␣   ῶ    ῷ    Ὸ   Ό   Ὼ   Ώ   ῼ   ´   ῾  ␣ \\
  ␣ ␣ \`ω\ypogegrammeni{}
  ω\ypogegrammeni{}
  \'ω\ypogegrammeni{}
  ␣ \~ω
  \~ω\ypogegrammeni{}
  \`Ο \'Ο \`Ω \'Ω
  ω\ypogegrammeni{}
  \'{} \<{} ␣
}

No case change:
\begin{quote}
  \selectlanguage{greek}
  \GreekExtended
\end{quote}
%
MakeUppercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeUppercase{\GreekExtended}
\end{quote}
%
MakeLowercase:
\begin{quote}
  \selectlanguage{greek}
  \MakeLowercase{\GreekExtended}
\end{quote}

Hiatus examples with short accent macros and literal base character:

\renewcommand{\HiatusShort}{\'αι, \'αυ, \>'αυ, \'>αυ, \>αυ, \'ει, \>αυ,
                          \>'ει, \'>ει}%
\ensuregreek{\HiatusShort\ $\mapsto$ \MakeUppercase{\HiatusShort}}.

\end{document}
gmilde commented 1 year ago

The implementation could be made simpler, more similar to the handling of literal characters, and safer if we had a framework to map a function (similar to \DeclareCaseChangeEquivalent) that is locale sensitive and distinguishs uppercase, titlecase, and lowercase (similar to \DeclareUppercaseMapping etc).

Then, I could, e.g., write

\DeclareUppercaseEquivalent[el]{\'}{\accACUTE}
\DeclareUppercaseEquivalent[el]{\`}{\accGRAVE}
\DeclareUppercaseEquivalent[el]{\~}{\accTILDE}
\DeclareUppercaseEquivalent[el]{\>}{\LGR@hiatus}

instead of adding to the \@uclclist.

The main advantage is that document parts that are not Greek will not be affected which lowers the danger of unwanted side-effects.

josephwright commented 1 year ago

@gmilde I have an idea that might be less 'heavy' and that uses \CaseSwtich: I'll need to test it out and will report back

gmilde commented 1 year ago

From the description in usrguide.pdf, I got the impression that \CaseSwitch is a user command to be used inside \MakeUppercase. Now I see that I could replace the hypothetical

\DeclareUppercaseEquivalent[el]{\'}{\accACUTE}

with

\DeclareCaseChangeEquivalent{\'}{%
  \CaseSwitch{\'}{\accACUTE}{\'}{\'}
}

This would replace the \@uclclist extension but still not be locale-specific. If one of \DeclareCaseChangeEquivalent or \CaseSwitch would grow an optional "locale" argument, this combination would become an alternative on par with my suggestion of four new configuration commands.

+1 less change, no new (rarely used) commands -1 a bit more verbose in usage

gmilde commented 1 year ago

I implemented and tested a comprehensive fix for case changing Greek input via LICR macros. See https://codeberg.org/milde/greek-tex and the test document char-list.tex. It used to work fine with TL21 and TL23 (before the latest update) and will hopefully work again after the fix for https://github.com/latex3/latex3/issues/1236. Feedback is welcome.

josephwright commented 1 year ago

@gmilde You are the expert here: if it works, then probably I won't make further changes at the expl3 end as this is essentially about 'legacy' input

gmilde commented 1 year ago

The releases of babel-greek 1.14 and greek-fontenc 2.5 implement and test fixes for MakeUppercase with "Greek" diacritics for the LGR, TU, and PU font encodings.

Open issues:

`\MakeLowercase{Σ} correctly downcases to a final sigma (ς) if the Σ is at the end of a word. In LGR fonts, this is handled by an "autsigma" character with ligature definition but in Unicode fonts currently a "normal" σ is printed.

For disambiguation, the Greek word or (ή / ἢ) keeps diacritics in UPPERCASE. The 2022 MakeUppercase handles this for literal input. It seems there is a test for whitespace on both sides of the eta (diacritics are dropped in, e.g., \MakeUppercase{ή, Ή. ἢ; Ἢ} if used in a Greek text part.

josephwright commented 1 year ago
  • The polytonic variant ETA WITH DASIA AND OXIA used in ἢ … ἤ (either … or) drops diacritics! By mistake, omission, or intent?

Based on https://icu.unicode.org/design/case/greek-upper, this is by-design; the data there shows

νομικοῦ ἢ διεθνοῦς →

ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ

so we check for both U+03AE and U+1F22 (and for U+1F2A), and always output U+0389 (Ή) for the isolated letter. IF that's a misinterpretation of the rule, could you provide a link to a demo - I really only had that ICU set to go with.

josephwright commented 1 year ago

It seems there is a test for whitespace on both sides of the eta (diacritics are dropped in, e.g., \MakeUppercase{ή, Ή. ἢ; Ἢ} if used in a Greek text part.

* Is this a correct test, are there corner cases/false positives?

The current implementation here uses a very simple-minded way to detect word boundaries. At the start of the text, and after every space (charcode 32), there is a 'boundary check' function. For the eta test, the approach is to check if

I've looked at the full Unicode word boundary algorithm: it's complex. What would be a lot easier would be to consider. the Unicode class of any following tokens: that would be able to deal with ἢ;, for example

josephwright commented 1 year ago

@gmilde Are we OK to close here and open new issues as required for what feel like independent ideas?