Remove unnecessary LaTeX encodings à la pylatexenc.latex2text

GraemeWatt commented 4 years ago

Firstly, thanks for releasing the packages on PyPI and npm. The Python package unicodeit v0.7.0 is now used in the @HEPData code for tweeting titles of high-energy physics publications via https://twitter.com/HEPData.

After applying unicodeit.replace to each paper title, we need to apply some cleanup operations to remove characters like $, {, }, ~, and encodings like \mathrm, \text, and \rm. Recently, we were alerted via a reply to a tweet that our code fails for \mathrm{t}\overline{\mathrm{t}}. We would need to first remove \mathrm (with appropriate matching of braces) before applying UnicodeIt, not afterwards. An alternative would be to use pylatexenc.latex2text which applies appropriate cleanup operations (although it seems \overline is not supported). The problematic paper title was:

Search for resonant $ \mathrm{t}\overline{\mathrm{t}} $ production in proton-proton collisions at $ \sqrt{s}=13 $ TeV

where UnicodeIt gives "Search for resonant $ \mathrm{t}\̅athrm{t}} $ production in proton−proton collisions at $ √{s}=13 $ TeV" and latex2text gives "Search for resonant tt production in proton-proton collisions at √(s)=13 TeV". For our intended application, it would probably make sense to switch to pylatexenc.latex2text instead of unicodeit. Clemens Lange (@clelange) pointed to some code based on pylatexenc.latex2text used for cleaning paper titles tweeted from the @CMSpapers, @LHCb_results, and @AtlasPapers Twitter accounts.

Is there any possibility to extend UnicodeIt to appropriately remove LaTeX encodings like $, {, }, ~, \mathrm, \text, \rm, etc., in a similar way to pylatexenc.latex2text? Feel free to close this issue if you think it is beyond the intended scope of UnicodeIt.

svenkreiss commented 4 years ago

Thanks @GraemeWatt for pointing that out. I think you are actually pointing out multiple things to improve that should all be addressed. Will definitely leave this issue open until this is addressed.

GraemeWatt commented 4 years ago

I'll try to give some more examples of problematic paper titles when I spot them that might be useful for future testing.

Tweet for Measurement of the $CP$ violating phase $\phi_{\text{s}}$ in the $\mathrm{B}_s \to \mathrm{J}/\psi\,\phi(1020) \to \mu^+\mu^-\,\mathrm{K}^+\mathrm{K}^-$ channel in proton-proton collisions at $\sqrt{s} = 13~\mathrm{TeV}$. unicodeit: Measurement of the $CP$ violating phase $ϕ_{\text{s}}$ in the $\mathrm{B}ₛ → \mathrm{J}/ψ ϕ(1020) → μ⁺μ⁻ \mathrm{K}⁺\mathrm{K}⁻$ channel in proton−proton collisions at $√{s} = 13~\mathrm{TeV}$ latex2text: Measurement of the CP violating phase ϕ_s in the B_s →J/ψ ϕ(1020) →μ^+μ^- K^+K^- channel in proton-proton collisions at √(s) = 13 TeV

HDembinski commented 1 year ago

@GraemeWatt Could you have a look at https://github.com/HDembinski/unicodeitplus ? I tried to address the issues with the parsing of a mix of LaTeX code and normal text in unicodeitplus. Running it on your Tweet gives me

Measurement of the 𝐶𝑃 violating phase 𝜙ₛ in the Bₛ→J/𝜓 𝜙(1020)→𝜇⁺𝜇⁻ K⁺K⁻ channel in proton-proton collisions at √𝑠̅=13~TeV

I am lacking a rule for ~, but that can be added easily.

GraemeWatt commented 1 year ago

@HDembinski : thanks, unicodeitplus looks great and better suited for our use case than the original unicodeit. I've opened HEPData/hepdata#664 to make the switch after some more testing. I've already identified some minor problems and I'll open new issues in the unicodeitplus repository.

HDembinski commented 1 year ago

Excellent thanks!

GraemeWatt commented 1 year ago

I just wrote a Jupyter notebook that gets the titles of all (almost 10,000) HEPData records and compares the output from latex2text, unicodeit and unicodeitplus. I hope it will be useful in testing future improvements to these tools.

svenkreiss / unicodeit

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25