phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
283 stars 35 forks source link

Unknown latex macros do not include arguments? #60

Closed craffel closed 3 years ago

craffel commented 3 years ago

Hi, thanks for this useful utility, and sorry for the usage question. I'm finding that the following code

print(LatexNodes2Text().latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
"""))

will output


gray
fixltx2e

Basically, the arguments to the first two macros get removed, whereas the arguments for the second two don't. I think this may be because the first two macros are built-in LaTeX macros, whereas the latter two are custom ones, so maybe pylatexenc doesn't know how many arguments there should be, so it defaults to assuming "no arguments". If I run this:

def raise_l2t_unknown_latex(n):
    if n.isNodeType(pylatexenc.latexwalker.LatexMacroNode):
        print(n.latex_verbatim())

l2t_db = pylatexenc.latex2text.get_default_latex_context_db()
l2t_db.set_unknown_macro_spec(
    pylatexenc.latex2text.MacroTextSpec("", simplify_repl=raise_l2t_unknown_latex)
)
LatexNodes2Text(latex_context=l2t_db).latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
""")

then indeed I get

\documentclass{article}
\usepackage{times}
\definecolor
\RequirePackage

Is there any way to make pylatexenc automatically try to grab as many arguments as possible from unknown macros? Thanks a lot!

phfaist commented 3 years ago

Thanks for the feedback. This behavior is by design: if a macro is unknown, there is no way to know if a brace that follows is an argument or simple LaTeX content that follows. So we assume that it does not accept any macros. Say you write:

\begin{equation}
  \unknownsymbol [A, B] + \anotherunknownsymbol {\textstyle \frac{1}{2}} = 0
\end{equation}

then there is no way to know if [A, B] is an optional argument to the unknown macro \unknownsymbol or if it is simply an expression in square brackets. Also is {\textstyle \frac{1}{2}} a macro argument for \anotherunknownsymbol or is it simply a piece of code where we locally wanted to set inline formatting (\textstyle) for the fraction?

The ambiguity is not only for math modes. Think about code like \begin{tabular}{ccl} ... \end{tabular} versus \begin{displayquote} {\bf Start with bold text} \end{displayquote}. Or {\bfseries{\itshape text}}. Also arguments aren't necessarily delimited by braces, so \setlength\columnsep{1in} should really be considering \columnsep and {1in} as arguments, but there is no way of knowing that these are arguments if you don't know how \setlength behaves.

In my opinion the best solution here is to identify which macros are unknown and add their signature to the latex context database.

You can nevertheless customize the behavior of the LatexWalker to include as many arguments as you can find. For this you can subclass MacroStandardArgsParser. Here's a simple way to get started, use at your own risk:

from pylatexenc import macrospec, latexwalker, latex2text

class AbsorbAllDetectedPossibleMacroArgumentsParser(macrospec.MacroStandardArgsParser):
    def parse_args(self, w, pos, parsing_state=None):
        argspec = ''
        argnlist = []
        origpos = pos
        while True:
            # inspect the following token at the given position (should skip
            # spaces if necessary)
            try:
                tok = w.get_token(pos)
            except latexwalker.LatexWalkerEndOfStream:
                break
            if tok.tok == 'char' and tok.arg.startswith('*'):
                argspec += '*'
                argnlist.append(
                    w.make_node(latexwalker.LatexCharsNode,
                                parsing_state=parsing_state,
                                chars='*', pos=tok.pos, len=1)
                )
                pos = tok.pos + 1
            elif tok.tok == 'char' and tok.arg.startswith('['):
                (node, np, nl) = w.get_latex_maybe_optional_arg(
                    pos,
                    parsing_state=parsing_state
                )
                pos = np + nl
                argspec += '['
                argnlist.append(node)
            elif tok.tok == 'brace_open':
                (node, np, nl) = w.get_latex_expression(
                    pos,
                    strict_braces=False,
                    parsing_state=parsing_state,
                )
                pos = np + nl
                argspec += '{'
                argnlist.append(node)
            else:
                # something else -- we're guessing that it's not a macro
                # argument
                break

        parsed = ParsedMacroArgs(
            argspec=argspec,
            argnlist=argnlist,
        )

        return (parsed, origpos, pos-origpos)

lw_db = latexwalker.get_default_latex_context_db()
lw_db.set_unknown_macro_spec(
    macrospec.MacroSpec("", AbsorbAllDetectedPossibleMacroArgumentsParser())
)
output = latex2text.LatexNodes2Text().latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
""", latex_context=lw_db)

# output.strip() == ""

Hope this helps!

craffel commented 3 years ago

Thanks @phfaist, this is very helpful and makes sense. I think inside of the document, this behavior makes a lot of sense (because \textbf{blah} should parse to blah, for example), but before \begin{document} I think we can be reasonably confident (or certain? I'm not sure how latex works) that \marco{argument} should not produce any text in the output document. The workaround I've been using is just to delete all of the content that appears before \begin{document} before using pylatexenc, and separately handling the fact that this sometimes removes title/author. I'm not sure if this is just a hack or should be incorporated into the general behavior of pylatexenc, but just as FYI. Thanks again.

phfaist commented 3 years ago

Yes, you're right that usually in a LaTeX document before \begin{document} usually any blocks are likely to be macro arguments. (But again not always, you could have for instance \makeatletter {\global\let\somecommand\@gobble ...} \makeatother.)

There are inherent limitations in pylatexenc here I think, namely that 1) it (at least currently) doesn't have any notion of a "document", it is only designed to work with chunks of latex content which it parses as markup content and 2) it only supports a small subset of latex, so most documents are likely to have code in the preamble that pylatexenc cannot hope to parse meaningfully. (E.g. anything with \makeatletter... is likely to confuse pylatexenc because it doesn't know about catcodes. It can even be severly confused with certain \newcommand instructions like \newcommand{\beginequation}{\begin{equation}}.). I think the better solution here is to support the macros that fit the type of parsing that pylatexenc does. If the documents often have complex preambles then a strategy like the one you mention, i.e. finding \begin{document} manually, is also likely to be more robust as otherwise pylatexenc can get confused by other preamble definitions.

Another strategy (outlined in issue #48) can be to start parsing the document from the top, one node at a time, while ignoring errors along the way, and while inspecting the nodes that you get for information you might be interested in (such as \title and \author), up until you reach the \begin{document} node which you can parse in one go.

Thanks again for the feedback!

craffel commented 3 years ago

That makes sense. Thanks again for the tips.