phfaist / pylatexenc

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
https://pylatexenc.readthedocs.io
MIT License
283 stars 35 forks source link

Allow for exception when converting #62

Closed RossWilliamson closed 3 years ago

RossWilliamson commented 3 years ago

It would be good to have an exception list when doing the conversion. For example I would like to keep \ref as \ref in order to put label markers in prior to latex. Right now that gets printed just as \ref.

phfaist commented 3 years ago

Hi, thanks for the issue. You can already achieve the desired result by specifying custom text conversions (see https://pylatexenc.readthedocs.io/en/latest/latex2text/):

from pylatexenc import latex2text

l2t_context = latex2text.get_default_latex_context_db()
l2t_context.add_context_category('preserve-custom-macros', prepend=True, macros=[
    latex2text.MacroTextSpec('ref', simplify_repl=r'\ref{%(1)s}')
    ],)
l2t = latex2text.LatexNodes2Text(latex_context=l2t_context)

latex = r'\emph{For the definition of $\alpha$, see also:} \ref{eq:a} \& \ref{eq:b}'
converted = l2t.latex_to_text(latex)
print(converted)
# outputs →  For the definition of α, see also: \ref{eq:a} & \ref{eq:b}

I'm closing this issue, feel free to reopen if I'm missing anything.

RossWilliamson commented 3 years ago

Thanks! I was wondering how you do this for the latexencode vs latex2text. I have a string which has a deliberate "\ref" in there that I need to preserve. I tried the following:

from pylatexenc import latexencode

 cr = [ latexencode.UnicodeToLatexConversionRule(latexencode.RULE_REGEX, [
    (re.compile(r'\\ref'), r'\\ref'),
 ], replacement_latex_protection='none'),
    'defaults'
 ]

u_to_l = latexencode.UnicodeToLatexEncoder(conversion_rules=cr)

u_to_l.unicode_to_latex(r'\ref{sec:pp:qq}')

but it returns \ref{sec:pp:qq} - i.e. it escapes the curly brackets which i not wanted

phfaist commented 3 years ago

Try:

import re
from pylatexenc import latexencode

cr = [
    latexencode.UnicodeToLatexConversionRule(latexencode.RULE_REGEX, [
        (re.compile(r'\\ref\{([^\}]+)\}'), r'\\ref{\1}'),
     ], replacement_latex_protection='none'),
     'defaults'
 ]

u_to_l = latexencode.UnicodeToLatexEncoder(conversion_rules=cr)

print( u_to_l.unicode_to_latex(r'See \ref{sec:pp:qq} for α=β') )
# prints: See \ref{sec:pp:qq} for \ensuremath{\alpha}=\ensuremath{\beta}

Also, using this regular expression rule, no escaping will happen within the argument of the \ref macro.