Closed lostenderman closed 1 year ago
The corresponding unit test is testfiles/CommonMark_0.30/links/059.test
:
% ---RESULT--- "example": 539,
%
% <p><a href="/url">ẞ</a></p>
%
% ---\RESULT---
<<<
[ẞ]
[SS]: /url
>>>
documentBegin
BEGIN link
- label: ẞ
- URI: /url
- title:
END link
documentEnd
Here is the result of running git checkout commonmark; cd tests; ./test.sh "testfiles/CommonMark_0.30/links/059.test"
:
Testfile testfiles/CommonMark_0.30/links/059.test
Format templates/plain/
Template templates/plain/input.tex.m4
Command pdftex --shell-escape --interaction=nonstopmode test.tex
*** test-expected.log 2022-12-22 11:52:04.127380188 +0100
--- test-actual.log 2022-12-22 11:52:10.727323328 +0100
***************
*** 1,7 ****
documentBegin
- BEGIN link
- - label: ẞ
- - URI: /url
- - title:
- END link
documentEnd
--- 1,2 ----
This issue seems related to the reader->normalize_tag()
method, which normalizes tags for references, indirect links and images, and notes. Currently, we use the Unicode-unaware string.lower()
method. For Unicode-aware transformations, we currently use either the Selene Unicode library or the utf8
built-in library of Lua 5.3 and 5.4 if Selene Unicode is unavailable (at strange platforms such as LuaMetaTeX).
The utf8
built-in library has no support for Unicode case. Selene Unicode only supports (1, 2) Unicode-aware lower-casing and upper-casing, which is different from case-folding. This leads me to believe that we cannot currently comply with this requirement, although we can at least use the unicode.utf8.lower()
method when Selene Unicode is available and only polyfill it with string.lower()
when Selene Unicode is unavailable. We should document that this is best-effort, remove unit test testfiles/CommonMark_0.30/links/059.test
, and ask for directions on the LuaTeX mailing list.
LaTeX3 implements case folding, see \str_casefold:n
in LaTeX3 interfaces. However, using LaTeX3 would require that we resolve indirect links in the TeX layer rather than during parsing in low-level Lua.
I digged around a bit and it seems that LaTeX3 actually reads UnicodeData.txt
to implement case-folding and does all the heavy lifting. This seems to be one of the rare cases, where it is easier to do data processing in TeX than in pure Lua. Perhaps we could load UnicodeData.txt
from the Markdown package and do case folding ourselves (after asking around the LuaTeX mailing list to see if we are reinventing wheels)?
Note that there is a cost associated with loading UnicodeData.txt
and we may load the Markdown package many times in older TeX engines such as pdfTeX, where we have to use the ANSI C system()
call to access Lua. Therefore, we would prefer to load UnicodeData.txt
lazily and use case-folding only when we fail to find a matching reference definition using both exact matching and ASCII lower-casing.
@lostenderman, can you please assign this issue to me?
However, using LaTeX3 would require that we resolve indirect links in the TeX layer rather than during parsing in low-level Lua.
As discussed in today's call, the Lua parser needs to keep track of the reference definitions, so that it can produce the correct parse tree. For example, the parse tree for [foo][bar][baz]
is ambiguous without knowing the reference definitions, see also example 569 in the CommonMark spec.
Therefore, implementing case-folding in Lua seems to be the only available option.
I posted a question at TeX StackExchange:
I would like to perform Unicode case-folding in LuaTeX.
Can you suggest whether there is an existing implementation available in TeX Live, or whether I should implement the case-folding algorithm myself using the
UnicodeData.txt
andCaseFolding.txt
files as a part of the unicode-data package in TeX Live?As a side note: LaTeX3 implements case-folding as TeX commands, see function
\str_casefold:n
in LaTeX3 interfaces. I would find it somewhat amusing if it would be easier to case-fold a string in TeX than it would be in Lua.
See https://spec.commonmark.org/0.30/#example-539