Closed michamos closed 3 years ago
I see. Interesting approach. One problem is with escaped whitespace, typically to separate non-braced macros from following text. A common pattern is
_l2t.latex_to_text('foo \\AA\\ foo')
'foo Å\\ foo'
the original handles that better
ol2t.latex_to_text('foo \\AA\\ foo')
'foo Å foo
both handle 'foo {\AA} foo' ok.
That's a special case of handling whitespace escaping slash
_l2t.latex_to_text('foo \ bar')
'foo \\ bar'
ol2t.latex_to_text('foo \ bar')
'foo bar'
and
_l2t.latex_to_text('foo \\ bar')
'foo \\ bar'
ol2t.latex_to_text('foo \\ bar')
'foo bar'
It's easy to address by adding " " to LATEX_ALLOWED_MACROS since it's part of base_macros. Then I get
_l2t.latex_to_text('foo \\AA\\ bar \\ wibble')
'foo Å bar wibble'
_l2t.latex_to_text('foo \\AA\ bar \ wibble')
'foo Å bar wibble'
some more adverse side effects: enclosing curlies are removed, this alters the meaning and scope of things
How to Correctly Stitch Together {\it Kepler} Data of a Blazhko Star
How to Correctly Stitch Together \it Kepler Data of a Blazhko Star
First detections of the [NII] 122 {\mu}m line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe
First detections of the [NII] 122 \mum line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe
AVAST Survey 0.4-1.0 {\mu}m Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt
AVAST Survey 0.4-1.0 \mum Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt
Central exclusive J/{\psi} and {\chi}c production at LHCb
Central exclusive J/\psi and \chic production at LHCb
this turns valid macros into invalid ones and removes meaningful whitespace
and, my suggested whitespace change is bad for macros which are left unchanged
Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL
Measurement of the \Sigma beam asymmetry for the \omega photo-production off the proton and the neutron at GRAAL
Profiles of Lyman\alpha\ Emission Lines
Profiles of Lyman\alpha Emission Lines
The nature of [S III]{\lambda}{\lambda}9096, 9532 emitters at z = 1.34 and 1.23
The nature of [S III]\lambda\lambda9096, 9532 emitters at z = 1.34 and 1.23
that space after the macros is somewhat important
change of scope
The estimate of emission region locations of {\it Fermi} FSRQs
The estimate of emission region locations of \it Fermi FSRQs
{\it Ab initio} perturbation calculations of realistic effective interactions in the Hartree--Fock basis
\it Ab initio perturbation calculations of realistic effective interactions in the Hartree--Fock basis
On the origin of the Type~{\sc ii} spicules - dynamic 3D MHD simulations
On the origin of the Type~\sc ii spicules - dynamic 3D MHD simulations
list of titles obtained via
x = perform_request_search(p="245__a:/\\\\/ -245__a:/\$/ ")
len(x)
2333
and
for t in titles:
tn = _l2t.latex_to_text(t)
if tn == t:
same.add(t)
else:
changed.add((t, tn))
len(changed), len(same)
(493, 1836)
for t, tn in changed:
print(f"{t}\n{tn}\n")
Thanks @tsgit for your input. I'm scratching my head about https://arxiv.org/abs/1306.5943 which has unicode characters in the visible title on arXiv and some of the metadata tags, but contains LaTeX macros for Greek letters in other metadata tags and (most importantly for us) in the OAI-PMH arXiv
export format. AFAICS, none of this behavior is documented, and trying to reverse-engineer this is a waste of time. I'll contact Martin from arXiv to try to get more information.
this looks quite good. it still eats multiple whitespace, though. the examples below all have 2 spaces after the macro, and are left with none
Measurement of the production of \Xi \iota pairs in jets at ...
Measurement of the production of Ξιpairs in jets at ...
J \psi Production Z Hadronic Decays
J ψProduction Z Hadronic Decays
Corrections to the \tau polarisation
Corrections to the τpolarisation
Results on a search for Higgs bosons in the h \nu\bar\nu channel at sqrt(s) =189 GeV using iterative discriminant analysis
Results on a search for Higgs bosons in the h νν̅channel at sqrt(s) =189 GeV using iterative discriminant analysis
this is the most common problem. reducing 2 spaces after a macro to none
it removes curlies that are meaningful
\overrightarrow{p}
\overrightarrowp
\lowercase{e}
\lowercasee
but it leaves other curlies
\bar{QCD}-
{̅Q̅C̅D̅}̅-
this is quite rare in titles, could be manually cleaned up
it has some unintended side effects an (wrong) parentheses
at \sqrt(s) =192-202 GeV
at √(()s) =192-202 GeV
that's a somewhat common pattern specific to \sqrt and could be manually cleaned up
I fixed 40 records with the pattern \sqrt(s)
to \sqrt{s}
Based on number of titles affected I think the main remaining issue is the whitespace after macro issue. It certainly affects readability.
Apart from the \sqrt(s)
pattern the other issues I flagged are quite infrequent.
one more problem: "comments" are stripped, that has unintended consequences
Data from Figure 12, 0-20%, shoulder from: Dihadron azimuthal correlations in Au$+$Au collisions at $\sqrt{s_{NN}}=$ 200 GeV
Data from Figure 12, 0-20
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95% CLs from: Higgs boson production cross-section measurements and their EFT interpretation in the $4\ell $ decay channel at $\sqrt{s}=$13 TeV with the ATLAS detector
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60% from: Strange hadron production in Au+Au collisions at $\sqrt{s_{NN}}=$7.7 , 11.5, 19.6, 27, and 39 GeV
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60
A 4% measurement of $H_0$ using the cumulative distribution of strong-lensing time delays in doubly-imaged quasars
A 4
there are a lot of those
the comments
issue is of course easily addressed by option
keep_comments=True
the whitespace issue is trickier. it seems like either pylatexenc leaves all whitespace alone or it gobbles up all whitespace as it is insignificant in math context.
I don't see an option to consume exactly one whitespace after a macro without {}
or trailing \
Thanks again for the comments @tsgit, I've just pushed a new commit.
keep_comments=True
is enabled to keep things starting with a %
;sqrt
handling to have different formatting based on whether the next char is a (
.Remaining difficult issues:
\foo\ bar
or \foo{} bar
instead of \foo bar
). We could set strict_latex_spaces="based-on-source"
as documented in https://pylatexenc.readthedocs.io/en/latest/latex2text/#latex-to-text-converter-class, but the opposite issue would arise, in that K\"a hler
would be translated to Kä hler
. I'm not sure that's much better.\foo{bar}
{bar}
is treated as a group following the macro, not its argument. Furthermore, I decided to preserve braces only for groups containing more than one char (after conversion) to avoid things like K{\"a}hler
-> K{ä}hler
. Again, there's nothing we can do here except teaching the latexwalker
parser about frequently occurring macros it doesn't know (see example in https://pylatexenc.readthedocs.io/en/latest/latex2text/#custom-latex-conversion-rules-a-simple-template).I don't see an easy way to solve these two issues. I think the current conversion has some quirks but is reasonable enough, so unless you discover new issues I would think it's ready to be merged.
Hi Micha,
thanks for special casing \\sqrt(s)
For the whitespace issue let's take a step back and look at actual data.
I should look at mixed math titles, to determine the total number of possibly affected records. It may be such a small fraction that it is not worth arguing about.
You are incorrect about accented characters and spaces. Space is not necessary for accented characters and trailing space is not consumed by latex. This also applies to your current implementation.
print(l2t('K\\"ahler or K\\"a hler Stanis\\l aw and Stanis{\\l}aw'))
Kähler or Kä hler Stanisław and Stanisław
The setting for whitespace affects things like Polish l \l
in a name.
I agree that this would be a bad thing for author
fields.
However for the title
field we should compare the frequency of (one letter?) macros (outside of math environments) where trailing space should possibly be consumed
to the frequency of symbol
type macros followed by a regular word and
(other) macros with two trailing spaces (which admittedly is a TeX-ism with some layout hinting - and it's not consistently used).
Among titles without any math delimiters I find 142 titles with a macro with 2 trailing spaces -- and inspection shows that this is overwhelmingly intentional spacing
I find zero titles with a name with a one letter macro. I do find two 1-letter instances
... the \\Z boson ...
... and D\\O Experiments
and here space is good. The contraction is hard to read
l2t('and D\\O Experiments'))
and DØExperiments
For pseudo math retaining space after macro is avoiding unintended contractions, preserves readability, and is not altering the meaning of formulas. One can argue about aesthetics of things like μνμ
vs. μ ν μ
.
Inspecting titles with a mix of math and other things is a bit more involved.
So for the title field, a different whitespace option could be used than for the author field.
I have not looked at impact on the abstract
field at all.
I also wonder how spacing affects search.
If the title contains \\lambda couplings
and that is converted to λcouplings
Is a title search for λ couplings
going to find that ?
Thanks T.
by my count, the total number of titles with \[A-z]
outside of math is 1465
so it's a small fraction of records overall
A common pattern in mixed math titles is \x \to \y
. In math mode there is some judicious whitespace on both sides of the arrow. In your current solution there is not
print(l2t('\\tau^- \\to K^*'))
τ^- →K^*
print(l2t('\\tau \\to \\mu \\nu_\\mu '))
τ→μν_μ
anyhow, most of these look like they need some manual adjustment.
@tsgit you're right about escape sequences inside words. Those should be rare in practice, so I've changed the setting to respect spacing in source (and collapsing two spaces to one, to avoid introducing extra whitespace for untranslated macros such as {\\sc ii}
).
latex-base
andadvanced-symbols
groups ofpylatexenc
). Macros not in these groups are preserved, as are braces when needed. The only issue is potentially wrong handling of whitespace, but that can't easily be fixed: we respect the spacing used in the source, but that might introduce additional whitespace if a macro is used in a middle of a word. That should be rare though.