phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
301 stars 37 forks source link

Argument parsers should be given the name of the encountered macro, in order to handle unknown macros #71

Closed gamboz closed 2 years ago

gamboz commented 2 years ago

I'm trying to have pylatexenc emit a warning when it finds an unknown macro.

I define an arguments parser that does nothing and emits a warning. Then I define a MacroSpec that uses this parser and finally I register it with the walker's context using set_unknown_macro_spec(). See the code below.

I gently ask if this is the correct way to go. I suspect that I'm missing something, because I get a warning at the end of the math mode (at the second "$" in the second example in the code below).

I would also like to emit the name of the unknown macro, but this is for later :)

"""Emit a waning on unknown macros - proof of concept."""

from pylatexenc import macrospec, latexwalker, latex2text
import logging

class DoNothingArgumentsParser(macrospec.MacroStandardArgsParser):
    """An argument parser that does nothing and emits a warning."""

    def parse_args(self, w, pos, parsing_state=None):
        """Override the parse_args method to emit the warning."""
        logging.warning("Unknown macro XXX at %s",
                        pos)
        return super().parse_args(w, pos, parsing_state=None)

walker_context = latexwalker.get_default_latex_context_db()
unknown_macro_spec = macrospec.MacroSpec(
    "",  # anything would do?
    args_parser=DoNothingArgumentsParser()
)
walker_context.set_unknown_macro_spec(unknown_macro_spec)

# first example
output = latex2text.LatexNodes2Text().latex_to_text(
    r"""\unknown""", latex_context=walker_context)
print(output)

print("===")

# second example
output = latex2text.LatexNodes2Text().latex_to_text(
    r"""start
$\mu $
\foo
\foobar
""", latex_context=walker_context)
print(output)

# Output:
# WARNING:root:Unknown macro XXX at 8
#
# ===
# WARNING:root:Unknown macro XXX at 11
# WARNING:root:Unknown macro XXX at 18
# WARNING:root:Unknown macro XXX at 26
# start
# μ
#
phfaist commented 2 years ago

Your approach is correct! Your code also captures the macro \mu which is not explicitly declared to the latexwalker. The default implementation relies on the behavior for default macros, which is to keep them as a macro node with no arguments. (The macro is separately declared for latex2text as representing the unicode "μ" symbol.)

I realize it's a bit of a weakness of the API for now that the parse_args() method is not given information about the macro/environment that is currently being parsed. This is usually not a problem in typical settings where you set a MacroSpec or EnvironmentSpec to specific macros, since in such cases the parser is usually tailored to a specific macro/environment. A possible approach to display the unknown macro name is to hook directly into the LatexContextDb object. I also realize that these objects don't expose a simple way of doing this, but the following code achieves the desired behavior:

from pylatexenc import latexwalker, macrospec, latex2text

class UnknownMacroArgsParser(macrospec.MacroStandardArgsParser):
    def __init__(self, macroname):
        super().__init__()
        self.macroname = macroname

    def parse_args(self, w, pos, parsing_state=None):
        print("Unknown macro `\\{}' at {}".format(self.macroname, pos))
        return super().parse_args(w, pos, parsing_state=parsing_state)

class CustomLatexContextDb(macrospec.LatexContextDb):
    def __init__(self, db):
        super().__init__()
        for cat in db.categories():
            self.add_context_category(
                cat,
                macros=db.iter_macro_specs([cat]),
                environments=db.iter_environment_specs([cat]),
                specials=db.iter_specials_specs([cat]),
            )

    def get_macro_spec(self, macroname):
        mspec = super().get_macro_spec(macroname)
        if mspec is not None:
            mspec
        return macrospec.MacroSpec(macroname, args_parser=UnknownMacroArgsParser(macroname))

walker_context = CustomLatexContextDb(latexwalker.get_default_latex_context_db())

# second example
output = latex2text.LatexNodes2Text().latex_to_text(
    r"""start
$\mu $
\foo
\foobar
""", latex_context=walker_context)
print(output)
# prints:
#
# Unknown macro `\mu' at 11
# Unknown macro `\foo' at 18
# Unknown macro `\foobar' at 26
# start
# μ

It's not a particularly elegant solution, and I'll look into how to make this easier in future versions of pylatexenc.

Regarding macros that are considered as unknown to latexwalker but are known to latex2text, you could consider emitting a warning only after performing a search in the latex2text context db object (call l2tcontext.get_macro_spec(macroname) and check if it is None, where l2tcontext is the context-db object used by latex2text). I hope this helps.

I'm going to change the issue title to reflect that the desired improvement to pylatexenc is that unknown macro/environment/specials handlers be given more information about what macro/environment/specials was encountered.

phfaist commented 2 years ago

Actually, I realize that issue #32 already asked a very similar question. If you care about converting to text, not necessarily about obtaining the argument structure, you can plug into latex2text's context db to issue warnings for unknown macros. See my comment in issue #32.

gamboz commented 2 years ago

Thank you for the clarifications. Yes, #32 is better for my use case (sorry I didn't spot it by myself).