phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
312 stars 37 forks source link

Support for `\newenvironment` wrapping another environment #50

Open dlaidig opened 3 years ago

dlaidig commented 3 years ago

latex2text fails when parsing a document that contains a \newenvironment command that wraps an existing environment. I have been able to narrow it down to the following minimum example:

latex2text --code '\newenvironment{annotate}{\begin{scope}}{\end{scope}}'

which gives the following output:

INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,39)
INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'scope' @(1,41)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2248, in do_read
    mspec.parse_args(w=self, pos=tok.pos + tok.len,
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/__init__.py", line 95, in parse_args
    return self.args_parser.parse_args(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/_argparsers.py", line 293, in parse_args
    (node, np, nl) = w.get_latex_expression(
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1551, in get_latex_expression
    tok = self.get_token(pos, environments=False, parsing_state=parsing_state)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1356, in get_token
    raise LatexWalkerEndOfStream(final_space=space)
pylatexenc.latexwalker.LatexWalkerEndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/latex2text", line 11, in <module>
    load_entry_point('pylatexenc==2.8', 'console_scripts', 'latex2text')()
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latex2text/__main__.py", line 190, in main
    (nodelist, pos, len_) = lw.get_latex_nodes()
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2351, in get_latex_nodes
    r_endnow = do_read(nodelist, p)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2251, in do_read
    e = self._exchandle_parse_subexpression(
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1862, in _exchandle_parse_subexpression
    e.open_contexts.append(
AttributeError: 'LatexWalkerEndOfStream' object has no attribute 'open_contexts'

By trial and error, I found out that parsing works if I add a custom definition macrospec.std_macro('newenvironment', "*[[{{"),, i.e. remove the first { argument from the default *{[[{{.

phfaist commented 3 years ago

Hi and thanks for the report. There are two points to unpack here.

First, you're seeing an error mainly because pylatexenc does not really support the \newcommand/\newenvironment family of commands (see my comment in issue 48). It attempts to parse the arguments to \newenvironment like LaTeX blocks of text; it doesn't record in any way the new command or new environment for future parsing. In most cases you might not see any errors in tolerant parsing mode, but what really happens is that the \newcommand/\newenvironment instruction gets ignored (it gets parsed as a simple macro node and then converted to empty text), and then later in the document custom macros are handled using the default behavior for unknown macros or environments (which might or might not give you the desired behavior). I have plans for better support of \newcommand/\newenvironment commands, but they haven't been fully implemented yet. I've written some (experimental) code in my other project latexpp that expands some commands defined by \newcommand/\newenvironment. Depending on your use case, you might be able to reuse some code from https://github.com/phfaist/latexpp/blob/master/latexpp/fixes/newcommand.py for your purposes.

On the other hand, the additional exception you're seeing ("AttributeError: 'LatexWalkerEndOfStream' object has no attribute 'open_contexts'") is a bug and I'll look to fix it. Thanks for reporting.

phfaist commented 3 years ago

My latest commit should fix the weird chained exception that you reported. I'm leaving the issue open as an enhancement to enable \newenvironment wrapping another environment. This should be support once I get the \newcommand family of friends supported. (See also #48.) Thanks for reporting!

dlaidig commented 3 years ago

Thanks! I have tested the master branch and can confirm that the parsing error is gone. :)

I also noticed that now (and also with my proposed workaround) the resulting node representation is kind of screwed up and subsequent latex code is still regarded as belonging to the \newenvironment command. Do to coincidence, this does not bother me in my special use case, so I am happy as soon as I can parse the file without exceptions.

I guess in general it is still desirable to be able to parse \newenvironment properly and represent \begin and \end commands in the arguments in some reasonable way.

FYI: My use case is a LaTeX to text script that supports all the special macros I usually use and can handle includes with the standalone package. That includes some hacks to throw away the preambles from imported documents and lots of special formatters for custom commands. So in this case, I do not want \newcommand definitions to be parsed and applied like latexpp does, but I am perfectly happy with having a parsed node list that I can transform as needed.