phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
283 stars 35 forks source link

Incorrect (?) parsing of content in `lstlisting` environment #90

Open sw-dbrown opened 1 year ago

sw-dbrown commented 1 year ago

Hello, I have though a little bit more about the issue I described in https://github.com/phfaist/pylatexenc/issues/89. While I definitely understand and respect your reasoning I have come to the conclusion that the issue is not related to nodelist_to_latex() at all.

Consider the following example:

#!/usr/bin/env python3

from pylatexenc.latexwalker import LatexWalker

w = LatexWalker(
    r"""
\begin{lstlisting}
$ some content that should not be interpreted as tex!
\end{lstlisting}
""",
    tolerant_parsing=False,
)
(nodelist, pos, len_) = w.get_latex_nodes(pos=0)

print(nodelist)

It is important to note the presence of tolerant_parsing=False. Executing this code leads to the following error message:

Traceback (most recent call last):
  File "/tmp/mwe/./mwe.py", line 13, in <module>
    (nodelist, pos, len_) = w.get_latex_nodes(pos=0)
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
    r_endnow = do_read(nodelist, p)
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2245, in do_read
    (envnode, epos, elen) = self.get_latex_environment(
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 1834, in get_latex_environment
    (nodelist, npos, nlen) = self.get_latex_nodes(pos,
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
    r_endnow = do_read(nodelist, p)
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2187, in do_read
    (mathinline_nodelist, mpos, mlen) = self.get_latex_nodes(
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2376, in get_latex_nodes
    r_endnow = do_read(nodelist, p)
  File "/tmp/mwe/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/__init__.py", line 2121, in do_read
    raise LatexWalkerParseError(
pylatexenc.latexwalker.LatexWalkerParseError: Unexpected closing environment: 'lstlisting' @(4,0)
Open LaTeX blocks:
            @(2,0)  begin environment "lstlisting"
            @(3,0)  math mode "$"

If tolerant_parsing is set to True, the script will successfully complete execution and return the following output:

[LatexCharsNode(parsing_state=<parsing state 140426563620000>, pos=0, len=1, chars='\n'), 
LatexEnvironmentNode(parsing_state=<parsing state 140426563620000>, pos=1, len=90, environmentname='lstlisting', nodelist=
[LatexCharsNode(parsing_state=<parsing state 140426563620000>, pos=19, len=1, chars='\n'), LatexMathNode(parsing_state=
<parsing state 140426563620000>, pos=20, len=71, displaytype='inline', nodelist=[LatexCharsNode(parsing_state=<parsing state 
140426563621104>, pos=21, len=53, chars=' some content that should not be interpreted as tex!\n'), 
LatexCharsNode(parsing_state=<parsing state 140426563621104>, pos=90, len=1, chars='\n')], delimiters=('$', '$'))], 
nodeargd=ParsedMacroArgs(argspec='', argnlist=[]))]

Here, it can be seen that a LatexMathNode is created inside of the lstlisting environment. In my opinion, this is not correct behavior since content inside of the lstlisting environment should be treated as verbatim (one big LatexCharsNode (?)). The construction of the unbalanced LatexMathNode is what eventually causes nodelist_to_latex() to produce invalid latex code but it is, in my opinion, not the root of the problem. Similarly, in the second example in https://github.com/phfaist/pylatexenc/issues/89 the sequence \" gets turned into a LatexMacroNode.

Substituting the lstlisting environment for a verbatim environment instead leads to correct results.

Again, thank you very much for your time!

sw-dbrown commented 1 year ago

Digging a little bit into the source it appears that there is currently no special handling for the lstlisting environment as there is for verbatim. Should VerbatimArgsParser be adapted to handle more environments in addition to verbatim?

Edit: A quick change which causes lstlisting to be handled in a similar way as the verbatim environment:

From 94780a144e1f79ec17bfede62e0f3b10af3e65c9 Mon Sep 17 00:00:00 2001
From: David Brown <dbrown@schutzwerk.com>
Date: Fri, 14 Oct 2022 09:54:54 +0200
Subject: [PATCH] Handle lstlisting as verbatim environment

---
 .../pylatexenc/latexwalker/_defaultspecs.py   | 17 ++++++++++++++
 .../pylatexenc/macrospec/_argparsers.py       | 23 +++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py b/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
index 3306a89..a8f6e24 100644
--- a/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
+++ b/venv/lib/python3.10/site-packages/pylatexenc/latexwalker/_defaultspecs.py
@@ -308,6 +308,23 @@ specs = [
             # for verbatim |\like \this|...
         ]}),

+
+    #
+    # CATEGORY: listings
+    #
+    ('listings', {
+        'macros': [
+            # TODO: Maybe handle lstinline as well?
+            ],
+        'environments': [
+            EnvironmentSpec('lstlisting',
+                            args_parser=VerbatimArgsParser(verbatim_arg_type='lstlisting-environment')),
+        ],
+        'specials': [
+            # optionally users could include the specials "|" like in latex-doc
+            # for verbatim |\like \this|...
+        ]}),
+
     #
     # CATEGORY: theorems
     #
diff --git a/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py b/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
index 5d06724..39b311a 100644
--- a/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
+++ b/venv/lib/python3.10/site-packages/pylatexenc/macrospec/_argparsers.py
@@ -451,6 +451,29 @@ class VerbatimArgsParser(MacroStandardArgsParser):
             )
             return (argd, pos, len_)

+        if self.verbatim_arg_type == 'lstlisting-environment':
+            # simply scan the string until we find '\end{lstlisting}'.  That's
+            # exactly how LaTeX processes it.
+            endverbpos = w.s.find(r'\end{lstlisting}', pos)
+            if endverbpos == -1:
+                raise latexwalker.LatexWalkerParseError(
+                    s=w.s,
+                    pos=pos,
+                    msg=r"Cannot find matching \end{lstlisting}"
+                )
+            # do NOT include the "\end{lstlisting}", latexwalker will expect to
+            # see it:
+            len_ = endverbpos-pos
+
+            argd = ParsedVerbatimArgs(
+                verbatim_chars_node=w.make_node(latexwalker.LatexCharsNode,
+                                                parsing_state=parsing_state,
+                                                chars=w.s[pos:pos+len_],
+                                                pos=pos,
+                                                len=len_)
+            )
+            return (argd, pos, len_)
+
         if self.verbatim_arg_type == 'verb-macro':
             # read the next nonwhitespace char. This is the delimiter of the
             # argument
--
2.37.2

I am not sure if this is the correct way to implement this. It does, however, now lead to the correct behavior and has the nice side effect that nodelist_to_latex() returns the expected content :)

Note: This does not handle other macros or environments provided by the listings package (e.g. lstinline).

phfaist commented 1 year ago

Thanks for the careful analysis! I'll have a closer look at some point soon.

phfaist commented 1 year ago

An update here — the lstlisting environment is now recognized correctly in pylatexenc version 3.0alpha. I am still likely to update the way the contents is reported though. I'm aiming to have verbatim environments report their contents with their environment body nodelist containing a single LatexCharsNode with the verbatim content (which might include characters that could have a special meaning in LaTeX). This would take advantage of the new parsing features of pylatexenc 3.