phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
283 stars 35 forks source link

Input and output differs when converting a nodelist back into latex #89

Closed sw-dbrown closed 1 year ago

sw-dbrown commented 1 year ago

Hi, thanks for your very useful library! I have encountered the following issue and I'm not sure if its a bug or if I'm using it wrong. Consider the following minimal tex files:

dollar.tex

\begin{lstlisting}
$ xxxx -x 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
\end{lstlisting}

wrap.tex

\begin{lstlisting}
"xxxx": "Cxxxxxxxx xxxxx \"xxxx-xxxxxxxx-xxxxxxxxxx-xxxxxx.xxxxxxxxxx-xxxxx-xxx-xxxx
.xxx.xxxxxxx.xxxxx:yyyy/xxxxxxxxxx/xxxxx-xxxx-yx-yx:yy.yy.y.y.yyyyyyyy-yyyy-yyxxyyxxy\"
xxxxxxx xxxxxxx xx xxxxxxx",
\end{lstlisting}

And the following python script:

#!/usr/bin/env python3

from pylatexenc.latexwalker import (
    LatexWalker,
    nodelist_to_latex,
    get_default_latex_context_db,
)

for file in ["dollar.tex", "wrap.tex"]:
    with open(file, "r") as fh:
        latex_code = fh.read()

    walker = LatexWalker(latex_code, latex_context=get_default_latex_context_db())
    (nodelist, _, _) = walker.get_latex_nodes(pos=0)

    with open(file + ".mod", "w") as fh:
        fh.write(nodelist_to_latex(nodelist) + "\n")

After calling mwe.py, which reads the two latex files into nodelists and then converts them back to latex, the output differs from the input. This can be demonstrated as follows:

$ ./mwe.py
$ diff --new-line-format='+%L' --old-line-format='-%L' --unchanged-line-format=' %L' dollar.tex dollar.tex.mod
 \begin{lstlisting}
 $ xxxx -x 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-\end{lstlisting}
+
+$\end{lstlisting}
$ diff --new-line-format='+%L' --old-line-format='-%L' --unchanged-line-format=' %L' wrap.tex wrap.tex.mod
 \begin{lstlisting}
 "xxxx": "Cxxxxxxxx xxxxx \"xxxx-xxxxxxxx-xxxxxxxxxx-xxxxxx.xxxxxxxxxx-xxxxx-xxx-xxxx
-.xxx.xxxxxxx.xxxxx:yyyy/xxxxxxxxxx/xxxxx-xxxx-yx-yx:yy.yy.y.y.yyyyyyyy-yyyy-yyxxyyxxy\"
-xxxxxxx xxxxxxx xx xxxxxxx",
+.xxx.xxxxxxx.xxxxx:yyyy/xxxxxxxxxx/xxxxx-xxxx-yx-yx:yy.yy.y.y.yyyyyyyy-yyyy-yyxxyyxxy\"xxxxxxx xxxxxxx xx xxxxxxx",
 \end{lstlisting}

For dollar.tex the $ sign is also inserted in front of the \end{lstlisting} macro and therefore produces broken latex code. For the filewrap.tex`, a newline is removed and two lines are joined.

I remember reading somewhere that the use of nodelist_to_latex is discouraged, but I can't find the passage in the documentation anymore. Basically, I am writing a small tool that parses latex files into Python objects. My usecase is, that I would like to read a latex file, make some changes to the nodelist and write the latex file back. Is there a better alternative than nodelist_to_latex for this?

Thank you very much in advance for your time.

phfaist commented 1 year ago

Hi and thanks for the feedback.

It is recommended to avoid using the nodelist_to_latex() function for any purpose beyond the most simplest examples. You can check its source here — it was a function introduced in the early days of pylatexenc and has only really been able to serve since as an archaic debugging tool. (There's a method on node objects, latex_verbatim(), that return the latex code used to parse that node, but this method will not be affected by any modified node attributes.)

Pylatexenc does not (yet) support regenerating latex code from the parsed node tree. There are some issues related to the fact that pylatexenc is highly flexible and extensible, so any mechanism for going back to latex code will need to support those as well. For instance, {verbatim} and {lstlisting} environments are parsed in a special way to avoid parsing their contents as actual LaTeX code. (And note that a lot of changes to the parsing engine are being implemented in the background for the future v3 version of the library. We might be able to add in the future a mechanism for regenerating a node's latex code based on the information stored in its properties.)

It sounds like what you'd like to do would be better supported by my sister project latexpp, which is itself based on pylatexenc. Latexpp's purpose is precisely to read latex code, modify it based on the node structure, and output the modified latex code. I hope this helps!