arXiv: only unescape safe LaTeX macros

michamos commented 3 years ago

Instead of unescaping everything and potentially lose meaningful info, now only macros which can be translated losslessly are handled (latex-base and advanced-symbols groups of pylatexenc). Macros not in these groups are preserved, as are braces when needed. The only issue is potentially wrong handling of whitespace, but that can't easily be fixed: we respect the spacing used in the source, but that might introduce additional whitespace if a macro is used in a middle of a word. That should be rare though.
ref: inspirehep/inspirehep#1754

tsgit commented 3 years ago

I see. Interesting approach. One problem is with escaped whitespace, typically to separate non-braced macros from following text. A common pattern is

_l2t.latex_to_text('foo \\AA\\ foo')
'foo Å\\ foo'

the original handles that better

ol2t.latex_to_text('foo \\AA\\ foo')
'foo Å foo

both handle 'foo {\AA} foo' ok.

That's a special case of handling whitespace escaping slash

_l2t.latex_to_text('foo \ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \ bar')
'foo  bar'

and

_l2t.latex_to_text('foo \\ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \\ bar')
'foo  bar'

It's easy to address by adding " " to LATEX_ALLOWED_MACROS since it's part of base_macros. Then I get

_l2t.latex_to_text('foo \\AA\\ bar \\ wibble')
'foo Å bar  wibble'

_l2t.latex_to_text('foo \\AA\ bar \ wibble')
'foo Å bar  wibble'

tsgit commented 3 years ago

some more adverse side effects: enclosing curlies are removed, this alters the meaning and scope of things

How to Correctly Stitch Together {\it Kepler} Data of a Blazhko Star
How to Correctly Stitch Together \it Kepler Data of a Blazhko Star

First detections of the [NII] 122 {\mu}m line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe
First detections of the [NII] 122 \mum line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe

AVAST Survey 0.4-1.0 {\mu}m Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt
AVAST Survey 0.4-1.0 \mum Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt

Central exclusive J/{\psi} and {\chi}c production at LHCb
Central exclusive J/\psi and \chic production at LHCb

this turns valid macros into invalid ones and removes meaningful whitespace

and, my suggested whitespace change is bad for macros which are left unchanged

Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL
Measurement of the \Sigma beam asymmetry for the \omega photo-production off the proton and the neutron at GRAAL

Profiles of Lyman\alpha\ Emission Lines
Profiles of Lyman\alpha Emission Lines

The nature of [S III]{\lambda}{\lambda}9096, 9532 emitters at z = 1.34 and 1.23
The nature of [S III]\lambda\lambda9096, 9532 emitters at z = 1.34 and 1.23

that space after the macros is somewhat important

change of scope

The estimate of emission region locations of {\it Fermi} FSRQs
The estimate of emission region locations of \it Fermi FSRQs

{\it Ab initio} perturbation calculations of realistic effective interactions in the Hartree--Fock basis
\it Ab initio perturbation calculations of realistic effective interactions in the Hartree--Fock basis

On the origin of the Type~{\sc ii} spicules - dynamic 3D MHD simulations
On the origin of the Type~\sc ii spicules - dynamic 3D MHD simulations

list of titles obtained via

x = perform_request_search(p="245__a:/\\\\/ -245__a:/\$/ ")
len(x)
 2333

and

for t in titles:
    tn = _l2t.latex_to_text(t)
    if tn == t:
        same.add(t)
    else:
        changed.add((t, tn))
len(changed), len(same)
 (493, 1836)

for t, tn in changed:
    print(f"{t}\n{tn}\n")

michamos commented 3 years ago

Thanks @tsgit for your input. I'm scratching my head about https://arxiv.org/abs/1306.5943 which has unicode characters in the visible title on arXiv and some of the metadata tags, but contains LaTeX macros for Greek letters in other metadata tags and (most importantly for us) in the OAI-PMH arXiv export format. AFAICS, none of this behavior is documented, and trying to reverse-engineer this is a waste of time. I'll contact Martin from arXiv to try to get more information.

tsgit commented 3 years ago

this looks quite good. it still eats multiple whitespace, though. the examples below all have 2 spaces after the macro, and are left with none

Measurement of the production of  \Xi \iota  pairs in jets at ... 
Measurement of the production of  Ξιpairs in jets at ...

J \psi  Production Z Hadronic Decays
J ψProduction Z Hadronic Decays

Corrections to the  \tau  polarisation
Corrections to the  τpolarisation

Results on a search for Higgs bosons in the h \nu\bar\nu  channel at  sqrt(s) =189 GeV using iterative discriminant analysis
Results on a search for Higgs bosons in the h νν̅channel at  sqrt(s) =189 GeV using iterative discriminant analysis

this is the most common problem. reducing 2 spaces after a macro to none

it removes curlies that are meaningful

\overrightarrow{p}
\overrightarrowp

\lowercase{e}
\lowercasee

but it leaves other curlies

\bar{QCD}-
{̅Q̅C̅D̅}̅-

this is quite rare in titles, could be manually cleaned up

it has some unintended side effects an (wrong) parentheses

at  \sqrt(s) =192-202 GeV
at  √(()s) =192-202 GeV

that's a somewhat common pattern specific to \sqrt and could be manually cleaned up I fixed 40 records with the pattern \sqrt(s) to \sqrt{s}

tsgit commented 3 years ago

Based on number of titles affected I think the main remaining issue is the whitespace after macro issue. It certainly affects readability. Apart from the \sqrt(s) pattern the other issues I flagged are quite infrequent.

tsgit commented 3 years ago

one more problem: "comments" are stripped, that has unintended consequences

Data from Figure 12, 0-20%, shoulder from: Dihadron azimuthal correlations in Au$+$Au collisions at $\sqrt{s_{NN}}=$ 200 GeV
Data from Figure 12, 0-20

Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95% CLs from: Higgs boson production cross-section measurements and their EFT interpretation in the $4\ell $ decay channel at $\sqrt{s}=$13 TeV with the ATLAS detector
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95

Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60% from: Strange hadron production in Au+Au collisions at $\sqrt{s_{NN}}=$7.7 , 11.5, 19.6, 27, and 39 GeV
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60

A 4% measurement of $H_0$ using the cumulative distribution of strong-lensing time delays in doubly-imaged quasars
A 4

there are a lot of those

tsgit commented 3 years ago

the comments issue is of course easily addressed by option

keep_comments=True

the whitespace issue is trickier. it seems like either pylatexenc leaves all whitespace alone or it gobbles up all whitespace as it is insignificant in math context. I don't see an option to consume exactly one whitespace after a macro without {} or trailing \

michamos commented 3 years ago

Thanks again for the comments @tsgit, I've just pushed a new commit.

keep_comments=True is enabled to keep things starting with a %;
I've overridden sqrt handling to have different formatting based on whether the next char is a (.

Remaining difficult issues:

disappearing spacing: as you certainly know, whitespace in LaTeX is not significant, so 1 vs 2 spaces after a macro are treated in the same way. Macros eat the whitespace after them, so the current uses where space gets gobbled are not strictly LaTeX but TeX-isms (proper LaTeX would be \foo\ bar or \foo{} bar instead of \foo bar). We could set strict_latex_spaces="based-on-source" as documented in https://pylatexenc.readthedocs.io/en/latest/latex2text/#latex-to-text-converter-class, but the opposite issue would arise, in that K\"a hler would be translated to Kä hler. I'm not sure that's much better.
braces: the problem here is that macros in LaTeX can take a variable number of arguments delimited by curly braces. For unknown macros, by definition we don't know how many arguments they take, so they are treated by the library as taking zero arguments, so in \foo{bar} {bar} is treated as a group following the macro, not its argument. Furthermore, I decided to preserve braces only for groups containing more than one char (after conversion) to avoid things like K{\"a}hler -> K{ä}hler. Again, there's nothing we can do here except teaching the latexwalker parser about frequently occurring macros it doesn't know (see example in https://pylatexenc.readthedocs.io/en/latest/latex2text/#custom-latex-conversion-rules-a-simple-template).

I don't see an easy way to solve these two issues. I think the current conversion has some quirks but is reasonable enough, so unless you discover new issues I would think it's ready to be merged.

tsgit commented 3 years ago

Hi Micha,

thanks for special casing \\sqrt(s)

For the whitespace issue let's take a step back and look at actual data.

I should look at mixed math titles, to determine the total number of possibly affected records. It may be such a small fraction that it is not worth arguing about.

You are incorrect about accented characters and spaces. Space is not necessary for accented characters and trailing space is not consumed by latex. This also applies to your current implementation.

print(l2t('K\\"ahler or K\\"a hler Stanis\\l aw and Stanis{\\l}aw'))
Kähler or Kä hler Stanisław and Stanisław

The setting for whitespace affects things like Polish l \l in a name.

I agree that this would be a bad thing for author fields.

However for the title field we should compare the frequency of (one letter?) macros (outside of math environments) where trailing space should possibly be consumed to the frequency of symbol type macros followed by a regular word and (other) macros with two trailing spaces (which admittedly is a TeX-ism with some layout hinting - and it's not consistently used).

Among titles without any math delimiters I find 142 titles with a macro with 2 trailing spaces -- and inspection shows that this is overwhelmingly intentional spacing I find zero titles with a name with a one letter macro. I do find two 1-letter instances ... the \\Z boson ... ... and D\\O Experiments and here space is good. The contraction is hard to read

l2t('and D\\O Experiments'))
and DØExperiments

For pseudo math retaining space after macro is avoiding unintended contractions, preserves readability, and is not altering the meaning of formulas. One can argue about aesthetics of things like μνμ vs. μ ν μ.

Inspecting titles with a mix of math and other things is a bit more involved.

So for the title field, a different whitespace option could be used than for the author field.

I have not looked at impact on the abstract field at all.

I also wonder how spacing affects search. If the title contains \\lambda couplings and that is converted to λcouplings Is a title search for λ couplings going to find that ?

Thanks T.

tsgit commented 3 years ago

by my count, the total number of titles with \[A-z] outside of math is 1465 so it's a small fraction of records overall

A common pattern in mixed math titles is \x \to \y. In math mode there is some judicious whitespace on both sides of the arrow. In your current solution there is not

print(l2t('\\tau^- \\to K^*'))
τ^- →K^*

print(l2t('\\tau \\to \\mu \\nu_\\mu '))
τ→μν_μ

anyhow, most of these look like they need some manual adjustment.

michamos commented 3 years ago

@tsgit you're right about escape sequences inside words. Those should be rare in practice, so I've changed the setting to respect spacing in source (and collapsing two spaces to one, to avoid introducing extra whitespace for untranslated macros such as {\\sc ii}).

inspirehep / hepcrawl

arXiv: only unescape safe LaTeX macros #299