lostenderman / markdown

:notebook_with_decorative_cover: A package for converting and rendering markdown documents in TeX
http://ctan.org/pkg/markdown
LaTeX Project Public License v1.3c
1 stars 0 forks source link

HTML nonentities produce no output and occasionally fail to be parsed altogether #102

Closed lostenderman closed 1 year ago

lostenderman commented 1 year ago

See https://spec.commonmark.org/0.30/#example-28

No output is produced just as is Modified sometimes fails to be parsed with - markdown.lua:2359: bad argument # 1 to 'char' (invalid value)

Witiko commented 1 year ago

The corresponding unit test is testfiles/CommonMark_0.30/entity_and_numeric_character_references/004.test:

%   ---RESULT--- "example": 28,
%   
%   <p><em>&amp;nbsp &amp;x; &amp;#; &amp;#x;</em>
%   <em>&amp;#87654321;</em>
%   <em>&amp;#abcdef0;</em>
%   <em>&amp;ThisIsNotDefined; &amp;hi?;</em></p>
%   
%   ---\RESULT---

<<<
*&nbsp &x; &#; &#x;*
*&#87654321;*
*&#abcdef0;*
*&ThisIsNotDefined; &hi?;*
>>>
documentBegin
emphasis: (ampersand)nbsp (ampersand)x; (ampersand)(hash); (ampersand)(hash)x;
emphasis: (ampersand)(hash)87654321;
emphasis: (ampersand)(hash)abcdef0;
emphasis: (ampersand)ThisIsNotDefined; (ampersand)hi?;
documentEnd

Here is the result of running git checkout commonmark; cd tests; ./test.sh "testfiles/CommonMark_0.30/entity_and_numeric_character_references/004.test":

Testfile testfiles/CommonMark_0.30/entity_and_numeric_character_references/004.test
  Format templates/plain/
    Template templates/plain/input.tex.m4
      Command luatex                                   --interaction=nonstopmode  test.tex
    Template templates/plain/verbatim.tex.m4
      Command luatex                                   --interaction=nonstopmode  test.tex

I cannot reproduce the error that you report.

lostenderman commented 1 year ago

The markdown input

*&nbsp &x; &#; &#x;*
*&#87654321;*
*&#abcdef0;*
*&ThisIsNotDefined; &hi?;*

a

fails with

...
{path to markdown.lua}:2359: bad argument #1 to 'char' (invalid value)
stack traceback:
        [C]: in field 'char'
        {path to markdown.lua}:2359: in function </{path to markdown.lua}:2358>
        [C]: in function 'lpeg.match'
        {path to markdown.lua}:3239: in field 'parse_blocks'
        {path to markdown.lua}:3925: in local 'transform'
        {path to markdown.lua}:184: in field 'cache'
        {path to markdown.lua}:3929: in local 'convert'
        [\directlua]:1: in main chunk.
\lua_now:e #1->\__lua_now:n {#1}

l.23 \end{markdown}
Witiko commented 1 year ago

@lostenderman I can reproduce that:

$ git clone --single-branch --branch main https://github.com/witiko/markdown.git
$ cd markdown/
$ git remote add lostenderman https://github.com/lostenderman/markdown.git
$ git fetch lostenderman
$ git merge lostenderman/commonmark
$ make TEXLIVE_TAG=latest docker-image
$ rm -rf tests/templates/{latex,context}
$ docker run --rm -it -v "$PWD"/tests:/mnt -w /mnt witiko/markdown:latest
# ./test.sh testfiles/CommonMark_0.30/entity_and_numeric_character_references/004.test

Testfile testfiles/CommonMark_0.30/entity_and_numeric_character_references/004.test
  Format templates/plain/
    Template templates/plain/input.tex.m4
      Command pdftex   --shell-escape                  --interaction=nonstopmode  test.tex
        Command terminated with exit code 1.
*** test-expected.log   2022-12-21 14:16:46.763767021 +0000
--- test-actual.log     2022-12-21 14:16:50.043896037 +0000
***************
*** 1,6 ****
- documentBegin
- emphasis: (ampersand)nbsp (ampersand)x; (ampersand)(hash); (ampersand)(hash)x;
- emphasis: (ampersand)(hash)87654321;
- emphasis: (ampersand)(hash)abcdef0;
- emphasis: (ampersand)ThisIsNotDefined; (ampersand)hi?;
- documentEnd
--- 0 ----

This seems to be an issue of missing sanity checks in function entities.dec_entity() (and likely also entities.hex_entity()), which we use to convert HTML entities to Unicode characters:

# cat /tmp/*/test.markdown.err

...cal/texlive/texmf-local/tex/luatex/markdown/markdown.lua:2359: bad argument #1 to 'char' (invalid value)

# kpsewhich markdown.lua

/usr/local/texlive/texmf-local/tex/luatex/markdown/markdown.lua

# head -n 2363 `kpsewhich markdown.lua` | tail -n 6

function entities.dec_entity(s)
  return unicode.utf8.char(tonumber(s))
end
function entities.hex_entity(s)
  return unicode.utf8.char(tonumber("0x"..s))
end

We should check that tonumber(s) is not nil. However, we will also need a higher-level fix, so that the parser doesn't even try to convert the non-entity to a Unicode character to begin with. Here are the relevant PEG patterns.