inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

parsers: escape latex in archive parser #298

Closed oguzdemirbasci closed 3 years ago

oguzdemirbasci commented 3 years ago

Description

LATEX characters in title and abstract are converted to unicode using pylatexenc as a postprocess.

Related Issue

https://github.com/inspirehep/inspirehep/issues/1754

tsgit commented 3 years ago

as far as I can tell this swallows whitespace and drops unrecognized macros hence it alters things in an undesirable way tests should be more comprehensive on real life examples

Search for a standard model-like Higgs boson at LEP2 in the H \nu\bar\nu channel using a probabilistic analysis

is turned into

Search for a standard model-like Higgs boson at LEP2 in the H νν̅channel using a probabilistic analysis

and

some custom \foobar macro

is turned into

some custom macro