lostenderman / markdown

:notebook_with_decorative_cover: A package for converting and rendering markdown documents in TeX
http://ctan.org/pkg/markdown
LaTeX Project Public License v1.3c
1 stars 0 forks source link

Unicode case fold is used #56

Closed lostenderman closed 1 year ago

lostenderman commented 1 year ago

See https://spec.commonmark.org/0.30/#example-539

Witiko commented 1 year ago

The corresponding unit test is testfiles/CommonMark_0.30/links/059.test:

%   ---RESULT--- "example": 539,
%   
%   <p><a href="/url">ẞ</a></p>
%   
%   ---\RESULT---

<<<
[ẞ]

[SS]: /url
>>>
documentBegin
BEGIN link
- label: ẞ
- URI: /url
- title: 
END link
documentEnd

Here is the result of running git checkout commonmark; cd tests; ./test.sh "testfiles/CommonMark_0.30/links/059.test":

Testfile testfiles/CommonMark_0.30/links/059.test
  Format templates/plain/
    Template templates/plain/input.tex.m4
      Command pdftex   --shell-escape                  --interaction=nonstopmode  test.tex
*** test-expected.log   2022-12-22 11:52:04.127380188 +0100
--- test-actual.log 2022-12-22 11:52:10.727323328 +0100
***************
*** 1,7 ****
  documentBegin
- BEGIN link
- - label: ẞ
- - URI: /url
- - title: 
- END link
  documentEnd
--- 1,2 ----
Witiko commented 1 year ago

This issue seems related to the reader->normalize_tag() method, which normalizes tags for references, indirect links and images, and notes. Currently, we use the Unicode-unaware string.lower() method. For Unicode-aware transformations, we currently use either the Selene Unicode library or the utf8 built-in library of Lua 5.3 and 5.4 if Selene Unicode is unavailable (at strange platforms such as LuaMetaTeX).

The utf8 built-in library has no support for Unicode case. Selene Unicode only supports (1, 2) Unicode-aware lower-casing and upper-casing, which is different from case-folding. This leads me to believe that we cannot currently comply with this requirement, although we can at least use the unicode.utf8.lower() method when Selene Unicode is available and only polyfill it with string.lower() when Selene Unicode is unavailable. We should document that this is best-effort, remove unit test testfiles/CommonMark_0.30/links/059.test, and ask for directions on the LuaTeX mailing list.

Witiko commented 1 year ago

LaTeX3 implements case folding, see \str_casefold:n in LaTeX3 interfaces. However, using LaTeX3 would require that we resolve indirect links in the TeX layer rather than during parsing in low-level Lua.

Witiko commented 1 year ago

I digged around a bit and it seems that LaTeX3 actually reads UnicodeData.txt to implement case-folding and does all the heavy lifting. This seems to be one of the rare cases, where it is easier to do data processing in TeX than in pure Lua. Perhaps we could load UnicodeData.txt from the Markdown package and do case folding ourselves (after asking around the LuaTeX mailing list to see if we are reinventing wheels)?

Note that there is a cost associated with loading UnicodeData.txt and we may load the Markdown package many times in older TeX engines such as pdfTeX, where we have to use the ANSI C system() call to access Lua. Therefore, we would prefer to load UnicodeData.txt lazily and use case-folding only when we fail to find a matching reference definition using both exact matching and ASCII lower-casing.

@lostenderman, can you please assign this issue to me?

Witiko commented 1 year ago

However, using LaTeX3 would require that we resolve indirect links in the TeX layer rather than during parsing in low-level Lua.

As discussed in today's call, the Lua parser needs to keep track of the reference definitions, so that it can produce the correct parse tree. For example, the parse tree for [foo][bar][baz] is ambiguous without knowing the reference definitions, see also example 569 in the CommonMark spec.

Therefore, implementing case-folding in Lua seems to be the only available option.

Witiko commented 1 year ago

I posted a question at TeX StackExchange:

I would like to perform Unicode case-folding in LuaTeX.

Can you suggest whether there is an existing implementation available in TeX Live, or whether I should implement the case-folding algorithm myself using the UnicodeData.txt and CaseFolding.txt files as a part of the unicode-data package in TeX Live?

As a side note: LaTeX3 implements case-folding as TeX commands, see function \str_casefold:n in LaTeX3 interfaces. I would find it somewhat amusing if it would be easier to case-fold a string in TeX than it would be in Lua.